🚀[FEA]: Clarify DDP dataset memory behavior and recommend lazy loading for large Zarr datasets

### Is this a new feature, an improvement, or a change to existing functionality?

Improvement

### How would you describe the priority of this feature request

Medium

### Please provide a clear description of problem you would like to solve.

I want to report a DDP memory behavior that we validated and solved in practice.

Root cause:
- In multi-GPU DDP, DistributedSampler correctly shards sample indices across ranks.
- However, each rank/process still instantiates its own dataset object.
- In the crash data path, eager loading materializes large Zarr arrays during dataset construction.
- With 8 GPUs (8 ranks), this effectively multiplies host RAM usage by rank count and can lead to OOM/SIGKILL.

Key point:
- The problem is not DistributedSampler itself.
- The expensive part is eager dataset/reader initialization in CPU RAM.

Validated solution:
- Switching to lazy loading (on-demand reads during iteration) reduced CPU RAM significantly.
- After this change, distributed training became stable and scalable.

Request for PhysicsNeMo:
1. Explicitly document in the crash/DDP example that DistributedSampler shards indices only, not dataset memory.
2. Recommend lazy loading for large Zarr datasets in multi-GPU runs.
3. Optionally add guidance for rank-aware early sharding of files/runs to reduce duplicate work across ranks.

Relevant paths in repo:
- examples/structural_mechanics/crash/train.py
- examples/structural_mechanics/crash/datapipe.py
- examples/structural_mechanics/crash/zarr_reader.py

### Describe any alternatives you have considered

- Lowering nproc_per_node: reduces memory pressure but also reduces throughput.
- Setting num_workers=0: helps worker overhead but does not remove rank-level dataset duplication.
- Keeping eager loading: still causes high RAM in multi-GPU DDP.
- Lazy loading: this is the solution that worked in our tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀[FEA]: Clarify DDP dataset memory behavior and recommend lazy loading for large Zarr datasets #1550

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🚀[FEA]: Clarify DDP dataset memory behavior and recommend lazy loading for large Zarr datasets #1550

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions