Fix float16 memory leak during 4-bit quantized model loading#44728
Closed
ajmeese7 wants to merge 4 commits intohuggingface:mainfrom
Closed
Fix float16 memory leak during 4-bit quantized model loading#44728ajmeese7 wants to merge 4 commits intohuggingface:mainfrom
ajmeese7 wants to merge 4 commits intohuggingface:mainfrom
Conversation
`Bnb4bitQuantize.convert()` holds a local reference (`value`) to the float16 source tensor while calling `Params4bit(...).to(device)`. Even though `_quantize()` replaces the parameter's storage with 4-bit data, the original float16 allocation stays pinned on GPU because `Params4bit.__new__` shares storage via `_make_subclass` and `value` keeps a second reference alive until `convert()` returns. Because `convert()` is called once per weight in a sequential loop, float16 data accumulates linearly. A 35B model (~67 GB float16) OOMs a 32 GB GPU at ~48% loaded, despite 4-bit reducing the final size to ~17 GB. Confirmed via instrumentation: `_quantize()` runs 300+ times with `memory_allocated()` delta of zero — the float16 is never freed. Fix: capture `target_device`, then `del value` and `input_dict.clear()` before calling `.to()`. This drops all external references so `_quantize()` can release the float16 storage when it replaces it with 4-bit data.
Verifies that float16 tensors are freed during Bnb4bitQuantize.convert() by checking that peak GPU memory stays below the float16 model size. Before the fix in the previous commit, float16 data accumulated on GPU (peak > fp16_size). After the fix, peak stays below fp16_size because each float16 tensor is freed when _quantize() replaces it with 4-bit.
Member
|
Can you check this ? #44576 |
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: bnb |
Author
|
@SunMarc yeah that seems to have addressed it, closing |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44728&sha=ac20e3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes a GPU memory leak in
Bnb4bitQuantize.convert()where float16 source tensors are never freed during 4-bit quantized model loading viafrom_pretrained, causing OOM on models whose float16 size exceeds GPU VRAM.Root cause:
convert()creates aParams4bitfrom a float16 tensor and calls.to(device)to trigger quantization. Although_quantize()replaces the parameter's storage with 4-bit data, the original float16 GPU allocation is never freed because:Params4bit.__new__shares storage with the float16dataviatorch.Tensor._make_subclassvalueinconvert()holds a second referenceinput_dict(from the caller) holds a third reference via the tensor listThese references keep the float16 storage alive until
convert()returns, but by then the next weight has already been allocated on GPU. Over hundreds of weights, float16 data accumulates linearly — a 35B model (~67 GB float16) OOMs a 32 GB GPU at ~48% loaded, despite 4-bit reducing the final model to ~17 GB.Confirmed via instrumentation:
_quantize()runs 300+ times withtorch.cuda.memory_allocated()delta of zero — the float16 storage is never released.Fix: Capture
target_device, thendel valueandinput_dict.clear()before calling.to(). This drops all external references so_quantize()can free the float16 storage when it replaces it with 4-bit data.Testing
Regression test added in
tests/quantization/bnb/test_4bit.py::Bnb4BitPeakMemoryTest:The test loads
facebook/opt-350mwithload_in_4bit=Trueand asserts that peak GPU memory stays below the float16 model size. Before the fix, float16 tensors accumulated so peak exceeded the float16 size (~889 MB). After the fix, peak is ~611 MB vs 662 MB float16 size — confirming float16 is freed per-weight during quantization.Before submitting
Who can review?
@SunMarc — quantization
@Cyrilvallez — model loading (
from_pretrained,core_model_loading.py)