System Info
transformers version: 4.57.6
- Platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.35
- Python version: 3.12.13
- Huggingface_hub version: 0.36.2
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
Who can help?
Tagging the tokenizers maintainers: @ArthurZucker @itazap
Information
Tasks
Reproduction
Bug Description:
When using a Fast Tokenizer with a ByteLevel BPE base (such as gpt2, qwen, roberta), adding new tokens (added_tokens) that contain specific Unicode characters results in corrupted decoding.
Specifically, characters whose lower byte matches control characters (0-32) are stripped of their upper byte during the fast decoding process.
For example:
- The character
č (U+010D) is incorrectly decoded as \r (0x0D / Carriage Return).
- The character
ć (U+0107) is incorrectly decoded as \a (0x07 / Bell).
- The character
đ (U+0111) is incorrectly decoded as \x11 (0x11 / Device Control 1).
This only happens when use_fast=True. The slow Python tokenizer (use_fast=False) handles these added tokens perfectly. The bug seems to lie in how the Rust backend's ByteLevel decoder processes added_tokens_decoder items, likely applying a bitwise & 0xFF operation (or similar) that strips the U+0100 offset.
Information
Tasks
Reproduction
Minimal Reproducible Example (MRE):
from transformers import AutoTokenizer
# Load standard ByteLevel BPE tokenizer
tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False)
# Add words containing the problematic characters
new_tokens = ["Začnimo", "kuća", "međa"]
tokenizer_fast.add_tokens(new_tokens)
tokenizer_slow.add_tokens(new_tokens)
print("--- FAST TOKENIZER (BUGGY) ---")
for word in new_tokens:
ids = tokenizer_fast.encode(word)
decoded = tokenizer_fast.decode(ids)
print(f"Original: {word} -> Decoded: {repr(decoded)}")
print("\n--- SLOW TOKENIZER (CORRECT) ---")
for word in new_tokens:
ids = tokenizer_slow.encode(word)
decoded = tokenizer_slow.decode(ids)
print(f"Original: {word} -> Decoded: {repr(decoded)}")
Expected Output:
The Fast Tokenizer should output the original words ("Začnimo", "kuća", "međa").
Actual Output:
--- FAST TOKENIZER (BUGGY) ---
Original: Začnimo -> Decoded: 'Za\rnimo' <-- 'č' get corrupted? 'č' becomes \r
Original: kuća -> Decoded: 'ku\x07a'
Original: međa -> Decoded: 'me\x11a'
--- SLOW TOKENIZER (CORRECT) ---
Original: Začnimo -> Decoded: 'Začnimo'
Original: kuća -> Decoded: 'kuća'
Original: međa -> Decoded: 'međa'
Expected behavior
Added tokens containing any valid Unicode characters should be safely decoded by the Fast Tokenizer without their byte-values being stripped or downcast to control characters.
System Info
transformersversion: 4.57.6- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
Who can help?
Tagging the tokenizers maintainers: @ArthurZucker @itazap
Information
Tasks
examplesfolderReproduction
Bug Description:
When using a Fast Tokenizer with a
ByteLevelBPE base (such asgpt2,qwen,roberta), adding new tokens (added_tokens) that contain specific Unicode characters results in corrupted decoding.Specifically, characters whose lower byte matches control characters (0-32) are stripped of their upper byte during the fast decoding process.
For example:
č(U+010D) is incorrectly decoded as\r(0x0D/ Carriage Return).ć(U+0107) is incorrectly decoded as\a(0x07/ Bell).đ(U+0111) is incorrectly decoded as\x11(0x11/ Device Control 1).This only happens when
use_fast=True. The slow Python tokenizer (use_fast=False) handles these added tokens perfectly. The bug seems to lie in how the Rust backend'sByteLeveldecoder processesadded_tokens_decoderitems, likely applying a bitwise& 0xFFoperation (or similar) that strips theU+0100offset.Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Minimal Reproducible Example (MRE):
Expected Output:
The Fast Tokenizer should output the original words ("Začnimo", "kuća", "međa").
Actual Output:
Expected behavior
Added tokens containing any valid Unicode characters should be safely decoded by the Fast Tokenizer without their byte-values being stripped or downcast to control characters.