-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Pull requests: huggingface/tokenizers
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
perf: skip alignment tracking in encode_fast normalization
#2022
opened Apr 10, 2026 by
ArthurZucker
Collaborator
Loading…
feat: Normalizer::normalize_str — skip NormalizedString allocation
#2020
opened Apr 10, 2026 by
ArthurZucker
Collaborator
Loading…
feat: open-addressing merge table with cache-line-local linear probing for BPE
#2012
opened Apr 8, 2026 by
ArthurZucker
Collaborator
Loading…
feat: compact vocabulary — single-allocation id→token store for BPE
#2011
opened Apr 8, 2026 by
ArthurZucker
Collaborator
Loading…
feat(pattern): parallel regex
find_matches for large inputs
#2003
opened Mar 31, 2026 by
McPatate
Member
Loading…
fix: skip serializing ByteLevel fields at their default value
#2001
opened Mar 30, 2026 by
ArthurZucker
Collaborator
Loading…
feat: performance, adding pcre2 backend + regex-shards (5-15% speedup)
#1968
opened Mar 19, 2026 by
michaelfeil
Contributor
Loading…
feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup)
#1967
opened Mar 19, 2026 by
michaelfeil
Contributor
Loading…
Fix type_ids not applied to overflow encodings
#1965
opened Mar 17, 2026 by
joaquinhuigomez
Loading…
Add get_special_tokens and is_special_token methods
#1945
opened Feb 5, 2026 by
ArthurZucker
Collaborator
Loading…
2 tasks done
Add post_process_tokens and post_process_ids methods
#1944
opened Feb 5, 2026 by
ArthurZucker
Collaborator
Loading…
3 tasks done
feat: add unk_token property to Unigram model
#1943
opened Feb 5, 2026 by
ArthurZucker
Collaborator
Loading…
4 tasks done
🚨 feat: add role_to_token field for special token metadata
#1942
opened Feb 5, 2026 by
ArthurZucker
Collaborator
Loading…
Use
unicode-normalization instead of unicode-normalization-alignments
#1912
opened Dec 14, 2025 by
IvanIsCoding
Loading…
Previous Next
ProTip!
Follow long discussions with comments:>50.