Fix float16 memory leak during 4-bit quantized model loading by ajmeese7 · Pull Request #44728 · huggingface/transformers

ajmeese7 · 2026-03-15T19:56:44Z

What does this PR do?

Fixes a GPU memory leak in Bnb4bitQuantize.convert() where float16 source tensors are never freed during 4-bit quantized model loading via from_pretrained, causing OOM on models whose float16 size exceeds GPU VRAM.

Root cause: convert() creates a Params4bit from a float16 tensor and calls .to(device) to trigger quantization. Although _quantize() replaces the parameter's storage with 4-bit data, the original float16 GPU allocation is never freed because:

Params4bit.__new__ shares storage with the float16 data via torch.Tensor._make_subclass
The local variable value in convert() holds a second reference
input_dict (from the caller) holds a third reference via the tensor list

These references keep the float16 storage alive until convert() returns, but by then the next weight has already been allocated on GPU. Over hundreds of weights, float16 data accumulates linearly — a 35B model (~67 GB float16) OOMs a 32 GB GPU at ~48% loaded, despite 4-bit reducing the final model to ~17 GB.

Confirmed via instrumentation: _quantize() runs 300+ times with torch.cuda.memory_allocated() delta of zero — the float16 storage is never released.

Fix: Capture target_device, then del value and input_dict.clear() before calling .to(). This drops all external references so _quantize() can free the float16 storage when it replaces it with 4-bit data.

Testing

Regression test added in tests/quantization/bnb/test_4bit.py::Bnb4BitPeakMemoryTest:

RUN_SLOW=1 python -m pytest tests/quantization/bnb/test_4bit.py::Bnb4BitPeakMemoryTest -xvs

The test loads facebook/opt-350m with load_in_4bit=True and asserts that peak GPU memory stays below the float16 model size. Before the fix, float16 tensors accumulated so peak exceeded the float16 size (~889 MB). After the fix, peak is ~611 MB vs 662 MB float16 size — confirming float16 is freed per-weight during quantization.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc — quantization
@Cyrilvallez — model loading (from_pretrained, core_model_loading.py)

`Bnb4bitQuantize.convert()` holds a local reference (`value`) to the float16 source tensor while calling `Params4bit(...).to(device)`. Even though `_quantize()` replaces the parameter's storage with 4-bit data, the original float16 allocation stays pinned on GPU because `Params4bit.__new__` shares storage via `_make_subclass` and `value` keeps a second reference alive until `convert()` returns. Because `convert()` is called once per weight in a sequential loop, float16 data accumulates linearly. A 35B model (~67 GB float16) OOMs a 32 GB GPU at ~48% loaded, despite 4-bit reducing the final size to ~17 GB. Confirmed via instrumentation: `_quantize()` runs 300+ times with `memory_allocated()` delta of zero — the float16 is never freed. Fix: capture `target_device`, then `del value` and `input_dict.clear()` before calling `.to()`. This drops all external references so `_quantize()` can release the float16 storage when it replaces it with 4-bit data.

Verifies that float16 tensors are freed during Bnb4bitQuantize.convert() by checking that peak GPU memory stays below the float16 model size. Before the fix in the previous commit, float16 data accumulated on GPU (peak > fp16_size). After the fix, peak stays below fp16_size because each float16 tensor is freed when _quantize() replaces it with 4-bit.

SunMarc · 2026-03-16T13:08:46Z

Can you check this ? #44576
This is where the issue might comes from

github-actions · 2026-03-16T20:43:30Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bnb

ajmeese7 · 2026-03-16T20:47:22Z

@SunMarc yeah that seems to have addressed it, closing

github-actions · 2026-03-16T20:53:54Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44728&sha=ac20e3

ajmeese7 added 3 commits March 15, 2026 15:25

style: linting

1aa0ac2

Merge branch 'main' into fix/bnb4bit-float16-memory-leak

ac20e31

ajmeese7 closed this Mar 16, 2026

ajmeese7 deleted the fix/bnb4bit-float16-memory-leak branch March 16, 2026 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix float16 memory leak during 4-bit quantized model loading#44728

Fix float16 memory leak during 4-bit quantized model loading#44728
ajmeese7 wants to merge 4 commits intohuggingface:mainfrom
TargetPackage:fix/bnb4bit-float16-memory-leak

ajmeese7 commented Mar 15, 2026

Uh oh!

SunMarc commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

ajmeese7 commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajmeese7 commented Mar 15, 2026

What does this PR do?

Testing

Before submitting

Who can review?

Uh oh!

SunMarc commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

ajmeese7 commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SunMarc commented Mar 16, 2026 •

edited

Loading