Skip to content

feat: add pretokenizer parameter to Tokenizer, update README#26

Merged
AmitMY merged 1 commit intomainfrom
feat/pretokenizer-param
Apr 8, 2026
Merged

feat: add pretokenizer parameter to Tokenizer, update README#26
AmitMY merged 1 commit intomainfrom
feat/pretokenizer-param

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Tokenizer(pretokenizer=...) accepts a regex string, callable, or None (GPT pattern default)
  • All subclasses (BPETokenizer, BNETokenizer, etc.) pass it through
  • SuperBPETokenizer passes pretokenizer to phase 1
  • Extract regex_pretokenizer() factory in words.py
  • Rewrite README with current API — remove stale WordsSegmentationTokenizer reference

Usage

# Default GPT pretokenization
tok = BPETokenizer()

# Custom regex
tok = BPETokenizer(pretokenizer=r"\w+|\S")

# Any callable str -> list[str]
tok = BPETokenizer(pretokenizer=str.split)

Test plan

  • 127 tests pass (3 new pretokenizer tests)
  • ruff check . passes

🤖 Generated with Claude Code

"[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?"
"|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?"
"|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a GPTPretokenizer = huggingface tokenizers Split(pattern=....)

that would then mean we always get a huggingface "Tokenizer" class (like words-segmentation as well is a tokenizer class)

We might need to create a RegexTokenizer(Tokenier) but that's fine as well

Comment thread complex_tokenization/tokenizer.py Outdated
units: str | Callable[[str], GraphVertex] = "utf8_clusters",
merge_size: int = 2,
connected: bool = False,
pretokenizer: str | Callable[[str], list[str]] | None = None,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always should be the huggingface tokenizer class type, default to GPTPretokenizer

- Tokenizer accepts pretokenizer as regex string, callable, or None
  (defaults to GPT pretokenization pattern)
- Extract regex_pretokenizer() factory for creating regex-based splitters
- Pass pretokenizer through to words() and SuperBPETokenizer phase 1
- Rewrite README with current API, remove stale WordsSegmentationTokenizer
- 3 new pretokenizer tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the feat/pretokenizer-param branch from 4922510 to 85b9e4f Compare April 8, 2026 20:58
@AmitMY AmitMY merged commit 75f9b1f into main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant