Skip to content

perf: skip alignment tracking in encode_fast normalization#2022

Open
ArthurZucker wants to merge 3 commits intomainfrom
fast-extract-normalize
Open

perf: skip alignment tracking in encode_fast normalization#2022
ArthurZucker wants to merge 3 commits intomainfrom
fast-extract-normalize

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

When encode_fast is called, the normalization step now skips per-byte alignment tracking — the main allocation overhead in the normalize path.

Problem

extract_and_normalize calls normalizer.normalize(&mut NormalizedString) which rebuilds alignment vectors (one (usize, usize) per byte) on every normalization. For encode_fast where offsets are explicitly not needed (OffsetType::None), this is wasted work.

Solution

  • NormalizedString::set_normalized(String) — replaces normalized content with trivial 1:1 alignments. Enough for the split/slice machinery to work, but no real offset mapping is preserved.
  • AddedVocabulary::extract_and_normalize_fast() — uses Normalizer::normalize_str() (no NormalizedString allocation in the normalizer) + set_normalized() to avoid alignment tracking.
  • encode_single_sequence — automatically picks the fast path when offsets_type == OffsetType::None.

What this skips per normalization call

Before (normalize):  original clone + normalized clone + Vec<(usize,usize)> per byte
After  (fast path):  normalize_str() → String + trivial alignment fill

For a normalizer like Lowercase, normalize_str is just s.to_lowercase() — one allocation, zero alignment work.

Depends on

…_fast

When encode_fast is called (OffsetType::None), the normalization step
now uses normalize_str + set_normalized instead of the full
NormalizedString::normalize which builds per-byte alignment vectors.

Changes:
- NormalizedString::set_normalized(): replace normalized content with
  trivial 1:1 alignments (enough for splitting, no real offset mapping)
- AddedVocabulary::extract_and_normalize_fast(): uses normalize_str
  for the normalization step, avoiding O(n) alignment allocations
- encode_single_sequence: automatically picks the fast path when
  offsets_type is None (i.e. encode_fast)
- Normalizer::normalize_str trait method added (default falls back to
  NormalizedString)
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

Python Benchmark Results

Commit: 0e4957c5f25d95ee9e8f4106138ac42fac52726f

Python Benchmarks

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

Rust Benchmark Results

Commit: 0e4957c5f25d95ee9e8f4106138ac42fac52726f

Rust Benchmarks

Stresses the normalize path during deserialization: 100k added tokens
with NFKC normalizer, saved to a temp file and loaded back. This
reflects real-world tokenizers with large added vocabularies.
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants