Skip to content

🚀[FEA]: Clarify DDP dataset memory behavior and recommend lazy loading for large Zarr datasets #1550

@gncurtosi

Description

@gncurtosi

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

I want to report a DDP memory behavior that we validated and solved in practice.

Root cause:

  • In multi-GPU DDP, DistributedSampler correctly shards sample indices across ranks.
  • However, each rank/process still instantiates its own dataset object.
  • In the crash data path, eager loading materializes large Zarr arrays during dataset construction.
  • With 8 GPUs (8 ranks), this effectively multiplies host RAM usage by rank count and can lead to OOM/SIGKILL.

Key point:

  • The problem is not DistributedSampler itself.
  • The expensive part is eager dataset/reader initialization in CPU RAM.

Validated solution:

  • Switching to lazy loading (on-demand reads during iteration) reduced CPU RAM significantly.
  • After this change, distributed training became stable and scalable.

Request for PhysicsNeMo:

  1. Explicitly document in the crash/DDP example that DistributedSampler shards indices only, not dataset memory.
  2. Recommend lazy loading for large Zarr datasets in multi-GPU runs.
  3. Optionally add guidance for rank-aware early sharding of files/runs to reduce duplicate work across ranks.

Relevant paths in repo:

  • examples/structural_mechanics/crash/train.py
  • examples/structural_mechanics/crash/datapipe.py
  • examples/structural_mechanics/crash/zarr_reader.py

Describe any alternatives you have considered

  • Lowering nproc_per_node: reduces memory pressure but also reduces throughput.
  • Setting num_workers=0: helps worker overhead but does not remove rank-level dataset duplication.
  • Keeping eager loading: still causes high RAM in multi-GPU DDP.
  • Lazy loading: this is the solution that worked in our tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ? - Needs TriageNeed team to review and classifyenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions