-
Notifications
You must be signed in to change notification settings - Fork 634
🚀[FEA]: Clarify DDP dataset memory behavior and recommend lazy loading for large Zarr datasets #1550
Copy link
Copy link
Open
Labels
? - Needs TriageNeed team to review and classifyNeed team to review and classifyenhancementNew feature or requestNew feature or request
Description
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem you would like to solve.
I want to report a DDP memory behavior that we validated and solved in practice.
Root cause:
- In multi-GPU DDP, DistributedSampler correctly shards sample indices across ranks.
- However, each rank/process still instantiates its own dataset object.
- In the crash data path, eager loading materializes large Zarr arrays during dataset construction.
- With 8 GPUs (8 ranks), this effectively multiplies host RAM usage by rank count and can lead to OOM/SIGKILL.
Key point:
- The problem is not DistributedSampler itself.
- The expensive part is eager dataset/reader initialization in CPU RAM.
Validated solution:
- Switching to lazy loading (on-demand reads during iteration) reduced CPU RAM significantly.
- After this change, distributed training became stable and scalable.
Request for PhysicsNeMo:
- Explicitly document in the crash/DDP example that DistributedSampler shards indices only, not dataset memory.
- Recommend lazy loading for large Zarr datasets in multi-GPU runs.
- Optionally add guidance for rank-aware early sharding of files/runs to reduce duplicate work across ranks.
Relevant paths in repo:
- examples/structural_mechanics/crash/train.py
- examples/structural_mechanics/crash/datapipe.py
- examples/structural_mechanics/crash/zarr_reader.py
Describe any alternatives you have considered
- Lowering nproc_per_node: reduces memory pressure but also reduces throughput.
- Setting num_workers=0: helps worker overhead but does not remove rank-level dataset duplication.
- Keeping eager loading: still causes high RAM in multi-GPU DDP.
- Lazy loading: this is the solution that worked in our tests.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
? - Needs TriageNeed team to review and classifyNeed team to review and classifyenhancementNew feature or requestNew feature or request