perf: skip alignment tracking in encode_fast normalization by ArthurZucker · Pull Request #2022 · huggingface/tokenizers

ArthurZucker · 2026-04-10T15:09:45Z

Summary

When encode_fast is called, the normalization step now skips per-byte alignment tracking — the main allocation overhead in the normalize path.

Problem

extract_and_normalize calls normalizer.normalize(&mut NormalizedString) which rebuilds alignment vectors (one (usize, usize) per byte) on every normalization. For encode_fast where offsets are explicitly not needed (OffsetType::None), this is wasted work.

Solution

NormalizedString::set_normalized(String) — replaces normalized content with trivial 1:1 alignments. Enough for the split/slice machinery to work, but no real offset mapping is preserved.
AddedVocabulary::extract_and_normalize_fast() — uses Normalizer::normalize_str() (no NormalizedString allocation in the normalizer) + set_normalized() to avoid alignment tracking.
encode_single_sequence — automatically picks the fast path when offsets_type == OffsetType::None.

What this skips per normalization call

Before (normalize):  original clone + normalized clone + Vec<(usize,usize)> per byte
After  (fast path):  normalize_str() → String + trivial alignment fill

For a normalizer like Lowercase, normalize_str is just s.to_lowercase() — one allocation, zero alignment work.

Depends on

feat: Normalizer::normalize_str — skip NormalizedString allocation #2020 for the full normalize_str implementations (this PR includes the trait method with default fallback, so it works standalone too)

…_fast When encode_fast is called (OffsetType::None), the normalization step now uses normalize_str + set_normalized instead of the full NormalizedString::normalize which builds per-byte alignment vectors. Changes: - NormalizedString::set_normalized(): replace normalized content with trivial 1:1 alignments (enough for splitting, no real offset mapping) - AddedVocabulary::extract_and_normalize_fast(): uses normalize_str for the normalization step, avoiding O(n) alignment allocations - encode_single_sequence: automatically picks the fast path when offsets_type is None (i.e. encode_fast) - Normalizer::normalize_str trait method added (default falls back to NormalizedString)

HuggingFaceDocBuilderDev · 2026-04-10T15:12:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-04-10T15:24:24Z

/benchmark

github-actions · 2026-04-10T15:30:02Z

Python Benchmark Results

Commit: 0e4957c5f25d95ee9e8f4106138ac42fac52726f

github-actions · 2026-04-10T15:32:00Z

Rust Benchmark Results

Commit: 0e4957c5f25d95ee9e8f4106138ac42fac52726f

Stresses the normalize path during deserialization: 100k added tokens with NFKC normalizer, saved to a temp file and loaded back. This reflects real-world tokenizers with large added vocabularies.

ArthurZucker · 2026-04-10T15:38:12Z

/benchmark

Merge branch 'main' into fast-extract-normalize

40a7eed

bench: add deserialize-100k-nfkc to ci_benchmark

0e4957c

Stresses the normalize path during deserialization: 100k added tokens with NFKC normalizer, saved to a temp file and loaded back. This reflects real-world tokenizers with large added vocabularies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip alignment tracking in encode_fast normalization#2022

perf: skip alignment tracking in encode_fast normalization#2022
ArthurZucker wants to merge 3 commits intomainfrom
fast-extract-normalize

ArthurZucker commented Apr 10, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 10, 2026

Summary

Problem

Solution

What this skips per normalization call

Depends on

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Benchmark Results

Uh oh!

github-actions bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rust Benchmark Results

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 10, 2026 •

edited

Loading

github-actions bot commented Apr 10, 2026 •

edited

Loading