Skip to content

fix: handle PreservingDataset in dataset concatenation#2213

Open
RudimentaryChef wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
RudimentaryChef:fix/preserving-dataset-concatenation
Open

fix: handle PreservingDataset in dataset concatenation#2213
RudimentaryChef wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
RudimentaryChef:fix/preserving-dataset-concatenation

Conversation

@RudimentaryChef
Copy link
Copy Markdown

@RudimentaryChef RudimentaryChef commented Apr 4, 2026

Summary

  • Fixes ValueError when using use_preserving_dataset=True with SFT training (Training dataload fails when using preserving dataset #2116)
  • concatenate_datasets from HuggingFace only accepts HF Dataset objects, but PreservingDataset is a custom class — causing a crash during data setup
  • Added a merge_datasets() helper in nemo_rl/data/utils.py that detects PreservingDataset instances and merges their .data lists directly, falling back to concatenate_datasets for standard HF datasets
  • Replaced call sites in examples/run_sft.py (train + val) and nemo_rl/data/utils.py (train + val)

Closes #2116

Test plan

  • Unit tests added for merge_datasets covering: multiple PreservingDatasets, multiple HF Datasets, single dataset, and heterogeneous structure preservation
  • Verify SFT training with use_preserving_dataset=True and a tool-call dataset completes without error

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@RudimentaryChef RudimentaryChef requested review from a team as code owners April 4, 2026 17:11
@RudimentaryChef RudimentaryChef force-pushed the fix/preserving-dataset-concatenation branch from 806fe33 to 8a5e5b2 Compare April 4, 2026 19:15
)

`concatenate_datasets` from HuggingFace only accepts HF Dataset objects,
causing a ValueError when `use_preserving_dataset=True`. Added a
`merge_datasets` helper that detects PreservingDataset instances and
merges their data lists directly instead.

Closes NVIDIA-NeMo#2116

Signed-off-by: Adi Krish <143638558+RudimentaryChef@users.noreply.github.com>
@RudimentaryChef RudimentaryChef force-pushed the fix/preserving-dataset-concatenation branch from 8a5e5b2 to cca50d1 Compare April 4, 2026 19:43
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 6, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RudimentaryChef , thanks for helping fix this!

We have two PRs addressing the same issue right now. After reviewing both, we've decided to move forward with #2218 as it also supports merging PreservingDataset with other datasets.

I'm going to close this one to avoid redundancy. We really appreciate your effort and hope to see more from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training dataload fails when using preserving dataset

4 participants