[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r')

### System Info

- `transformers` version: 4.57.6
- Platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.35
- Python version: 3.12.13
- Huggingface_hub version: 0.36.2
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed

### Who can help?

### Who can help?
Tagging the tokenizers maintainers: @ArthurZucker @itazap

### Information
- [x] The official example scripts
- [x] My own modified scripts

### Tasks
- [ ] An officially supported task in the `examples` folder
- [x] My own task or dataset (Tokenizer Expansion / Added Tokens)

### Reproduction
**Bug Description:**
When using a Fast Tokenizer with a `ByteLevel` BPE base (such as `gpt2`, `qwen`, `roberta`), adding new tokens (`added_tokens`) that contain specific Unicode characters results in corrupted decoding. 

Specifically, characters whose lower byte matches control characters (0-32) are stripped of their upper byte during the fast decoding process. 
For example:
- The character **`č`** (`U+010D`) is incorrectly decoded as **`\r`** (`0x0D` / Carriage Return).
- The character **`ć`** (`U+0107`) is incorrectly decoded as **`\a`** (`0x07` / Bell).
- The character **`đ`** (`U+0111`) is incorrectly decoded as **`\x11`** (`0x11` / Device Control 1).

This **only happens when `use_fast=True`**. The slow Python tokenizer (`use_fast=False`) handles these added tokens perfectly. The bug seems to lie in how the Rust backend's `ByteLevel` decoder processes `added_tokens_decoder` items, likely applying a bitwise `& 0xFF` operation (or similar) that strips the `U+0100` offset.

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

**Minimal Reproducible Example (MRE):**
```python
from transformers import AutoTokenizer

# Load standard ByteLevel BPE tokenizer
tokenizer_fast = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
tokenizer_slow = AutoTokenizer.from_pretrained("gpt2", use_fast=False)

# Add words containing the problematic characters
new_tokens = ["Začnimo", "kuća", "međa"]

tokenizer_fast.add_tokens(new_tokens)
tokenizer_slow.add_tokens(new_tokens)

print("--- FAST TOKENIZER (BUGGY) ---")
for word in new_tokens:
    ids = tokenizer_fast.encode(word)
    decoded = tokenizer_fast.decode(ids)
    print(f"Original: {word} -> Decoded: {repr(decoded)}")

print("\n--- SLOW TOKENIZER (CORRECT) ---")
for word in new_tokens:
    ids = tokenizer_slow.encode(word)
    decoded = tokenizer_slow.decode(ids)
    print(f"Original: {word} -> Decoded: {repr(decoded)}")
```
# Expected Output:
The Fast Tokenizer should output the original words ("Začnimo", "kuća", "međa").

# Actual Output:
```
--- FAST TOKENIZER (BUGGY) ---
Original: Začnimo -> Decoded: 'Za\rnimo' <-- 'č' get corrupted? 'č' becomes \r
Original: kuća -> Decoded: 'ku\x07a'
Original: međa -> Decoded: 'me\x11a'

--- SLOW TOKENIZER (CORRECT) ---
Original: Začnimo -> Decoded: 'Začnimo'
Original: kuća -> Decoded: 'kuća'
Original: međa -> Decoded: 'međa'
```

### Expected behavior

Added tokens containing any valid Unicode characters should be safely decoded by the Fast Tokenizer without their byte-values being stripped or downcast to control characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r') #1996

System Info

Who can help?

Who can help?

Information

Tasks

Reproduction

Information

Tasks

Reproduction

Expected Output:

Actual Output:

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r') #1996

Description

System Info

Who can help?

Who can help?

Information

Tasks

Reproduction

Information

Tasks

Reproduction

Expected Output:

Actual Output:

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions