Skip to content

Fix float16 memory leak during 4-bit quantized model loading#44728

Closed
ajmeese7 wants to merge 4 commits intohuggingface:mainfrom
TargetPackage:fix/bnb4bit-float16-memory-leak
Closed

Fix float16 memory leak during 4-bit quantized model loading#44728
ajmeese7 wants to merge 4 commits intohuggingface:mainfrom
TargetPackage:fix/bnb4bit-float16-memory-leak

Conversation

@ajmeese7
Copy link

What does this PR do?

Fixes a GPU memory leak in Bnb4bitQuantize.convert() where float16 source tensors are never freed during 4-bit quantized model loading via from_pretrained, causing OOM on models whose float16 size exceeds GPU VRAM.

Root cause: convert() creates a Params4bit from a float16 tensor and calls .to(device) to trigger quantization. Although _quantize() replaces the parameter's storage with 4-bit data, the original float16 GPU allocation is never freed because:

  1. Params4bit.__new__ shares storage with the float16 data via torch.Tensor._make_subclass
  2. The local variable value in convert() holds a second reference
  3. input_dict (from the caller) holds a third reference via the tensor list

These references keep the float16 storage alive until convert() returns, but by then the next weight has already been allocated on GPU. Over hundreds of weights, float16 data accumulates linearly — a 35B model (~67 GB float16) OOMs a 32 GB GPU at ~48% loaded, despite 4-bit reducing the final model to ~17 GB.

Confirmed via instrumentation: _quantize() runs 300+ times with torch.cuda.memory_allocated() delta of zero — the float16 storage is never released.

Fix: Capture target_device, then del value and input_dict.clear() before calling .to(). This drops all external references so _quantize() can free the float16 storage when it replaces it with 4-bit data.

Testing

Regression test added in tests/quantization/bnb/test_4bit.py::Bnb4BitPeakMemoryTest:

RUN_SLOW=1 python -m pytest tests/quantization/bnb/test_4bit.py::Bnb4BitPeakMemoryTest -xvs

The test loads facebook/opt-350m with load_in_4bit=True and asserts that peak GPU memory stays below the float16 model size. Before the fix, float16 tensors accumulated so peak exceeded the float16 size (~889 MB). After the fix, peak is ~611 MB vs 662 MB float16 size — confirming float16 is freed per-weight during quantization.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc — quantization
@Cyrilvallez — model loading (from_pretrained, core_model_loading.py)

`Bnb4bitQuantize.convert()` holds a local reference (`value`) to
the float16 source tensor while calling `Params4bit(...).to(device)`.
Even though `_quantize()` replaces the parameter's storage with 4-bit
data, the original float16 allocation stays pinned on GPU because
`Params4bit.__new__` shares storage via `_make_subclass` and `value`
keeps a second reference alive until `convert()` returns.

Because `convert()` is called once per weight in a sequential loop,
float16 data accumulates linearly. A 35B model (~67 GB float16) OOMs
a 32 GB GPU at ~48% loaded, despite 4-bit reducing the final size to
~17 GB. Confirmed via instrumentation: `_quantize()` runs 300+ times
with `memory_allocated()` delta of zero — the float16 is never freed.

Fix: capture `target_device`, then `del value` and `input_dict.clear()`
before calling `.to()`. This drops all external references so
`_quantize()` can release the float16 storage when it replaces it
with 4-bit data.
Verifies that float16 tensors are freed during Bnb4bitQuantize.convert()
by checking that peak GPU memory stays below the float16 model size.

Before the fix in the previous commit, float16 data accumulated on GPU
(peak > fp16_size). After the fix, peak stays below fp16_size because
each float16 tensor is freed when _quantize() replaces it with 4-bit.
@SunMarc
Copy link
Member

SunMarc commented Mar 16, 2026

Can you check this ? #44576
This is where the issue might comes from

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: bnb

@ajmeese7
Copy link
Author

@SunMarc yeah that seems to have addressed it, closing

@ajmeese7 ajmeese7 closed this Mar 16, 2026
@ajmeese7 ajmeese7 deleted the fix/bnb4bit-float16-memory-leak branch March 16, 2026 20:47
@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44728&sha=ac20e3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants