Skip to content

[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r') #1996

@zidsi

Description

@zidsi

System Info

  • transformers version: 4.57.6
  • Platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • Huggingface_hub version: 0.36.2
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

Who can help?

Tagging the tokenizers maintainers: @ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (Tokenizer Expansion / Added Tokens)

Reproduction

Bug Description:
When using a Fast Tokenizer with a ByteLevel BPE base (such as gpt2, qwen, roberta), adding new tokens (added_tokens) that contain specific Unicode characters results in corrupted decoding.

Specifically, characters whose lower byte matches control characters (0-32) are stripped of their upper byte during the fast decoding process.
For example:

  • The character č (U+010D) is incorrectly decoded as \r (0x0D / Carriage Return).
  • The character ć (U+0107) is incorrectly decoded as \a (0x07 / Bell).
  • The character đ (U+0111) is incorrectly decoded as \x11 (0x11 / Device Control 1).

This only happens when use_fast=True. The slow Python tokenizer (use_fast=False) handles these added tokens perfectly. The bug seems to lie in how the Rust backend's ByteLevel decoder processes added_tokens_decoder items, likely applying a bitwise & 0xFF operation (or similar) that strips the U+0100 offset.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Minimal Reproducible Example (MRE):

from transformers import AutoTokenizer

# Load standard ByteLevel BPE tokenizer
tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False)

# Add words containing the problematic characters
new_tokens = ["Začnimo", "kuća", "međa"]

tokenizer_fast.add_tokens(new_tokens)
tokenizer_slow.add_tokens(new_tokens)

print("--- FAST TOKENIZER (BUGGY) ---")
for word in new_tokens:
    ids = tokenizer_fast.encode(word)
    decoded = tokenizer_fast.decode(ids)
    print(f"Original: {word} -> Decoded: {repr(decoded)}")

print("\n--- SLOW TOKENIZER (CORRECT) ---")
for word in new_tokens:
    ids = tokenizer_slow.encode(word)
    decoded = tokenizer_slow.decode(ids)
    print(f"Original: {word} -> Decoded: {repr(decoded)}")

Expected Output:

The Fast Tokenizer should output the original words ("Začnimo", "kuća", "međa").

Actual Output:

--- FAST TOKENIZER (BUGGY) ---
Original: Začnimo -> Decoded: 'Za\rnimo' <-- 'č' get corrupted? 'č' becomes \r
Original: kuća -> Decoded: 'ku\x07a'
Original: međa -> Decoded: 'me\x11a'

--- SLOW TOKENIZER (CORRECT) ---
Original: Začnimo -> Decoded: 'Začnimo'
Original: kuća -> Decoded: 'kuća'
Original: međa -> Decoded: 'međa'

Expected behavior

Added tokens containing any valid Unicode characters should be safely decoded by the Fast Tokenizer without their byte-values being stripped or downcast to control characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions