feat: add pretokenizer parameter to Tokenizer, update README by AmitMY · Pull Request #26 · sign-language-processing/complex-tokenization

AmitMY · 2026-04-08T20:21:33Z

Summary

Tokenizer(pretokenizer=...) accepts a regex string, callable, or None (GPT pattern default)
All subclasses (BPETokenizer, BNETokenizer, etc.) pass it through
SuperBPETokenizer passes pretokenizer to phase 1
Extract regex_pretokenizer() factory in words.py
Rewrite README with current API — remove stale WordsSegmentationTokenizer reference

Usage

# Default GPT pretokenization
tok = BPETokenizer()

# Custom regex
tok = BPETokenizer(pretokenizer=r"\w+|\S")

# Any callable str -> list[str]
tok = BPETokenizer(pretokenizer=str.split)

Test plan

127 tests pass (3 new pretokenizer tests)
ruff check . passes

🤖 Generated with Claude Code

AmitMY · 2026-04-08T20:49:55Z

    "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?"
    "|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?"
    "|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
 )


let's create a GPTPretokenizer = huggingface tokenizers Split(pattern=....)

that would then mean we always get a huggingface "Tokenizer" class (like words-segmentation as well is a tokenizer class)

We might need to create a RegexTokenizer(Tokenier) but that's fine as well

AmitMY · 2026-04-08T20:53:07Z

        units: str | Callable[[str], GraphVertex] = "utf8_clusters",
        merge_size: int = 2,
        connected: bool = False,
+        pretokenizer: str | Callable[[str], list[str]] | None = None,


always should be the huggingface tokenizer class type, default to GPTPretokenizer

- Tokenizer accepts pretokenizer as regex string, callable, or None (defaults to GPT pretokenization pattern) - Extract regex_pretokenizer() factory for creating regex-based splitters - Pass pretokenizer through to words() and SuperBPETokenizer phase 1 - Rewrite README with current API, remove stale WordsSegmentationTokenizer - 3 new pretokenizer tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY commented Apr 8, 2026

View reviewed changes

AmitMY force-pushed the feat/pretokenizer-param branch from 4922510 to 85b9e4f Compare April 8, 2026 20:58

AmitMY merged commit 75f9b1f into main Apr 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pretokenizer parameter to Tokenizer, update README#26

feat: add pretokenizer parameter to Tokenizer, update README#26
AmitMY merged 1 commit intomainfrom
feat/pretokenizer-param

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY Apr 8, 2026

Uh oh!

AmitMY Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitMY commented Apr 8, 2026

Summary

Usage

Test plan

Uh oh!

AmitMY Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant