Skip to content

[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345

Open
shengliangxu wants to merge 12 commits intomainfrom
shengliangx/local-jsonl-dataset
Open

[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345
shengliangxu wants to merge 12 commits intomainfrom
shengliangx/local-jsonl-dataset

Conversation

@shengliangxu
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu commented Apr 24, 2026

Summary

Local .jsonl paths fed to get_dataset_samples / get_dataset_dataloader previously went through a text-key-only reader, while HF dataset names flowed through an auto-preprocess pipeline that recognizes messages / conversations / prompt / text / input columns. This split meant a calibration dataset behaved differently depending on whether it lived on HF Hub or on disk.

This PR routes local .jsonl through HF's json builder and the same auto-preprocess pipeline so the format is detected from columns, not the source. The legacy text-field reader is preserved as a fallback for files where the HF builder fails (e.g. PyArrow schema unification across heterogeneous rows). Existing callers passing a plain {"text": ...} JSONL file keep working unchanged; chat-shaped JSONL now works without a separate code path.

Changes

  • modelopt/torch/utils/dataset_utils.py
    • Route .jsonl paths through load_dataset(path="json", data_files=...) + _auto_preprocess_sample.
    • Wrap the HF load/preprocess block in try/except; on failure for .jsonl, fall back to get_jsonl_text_samples. If the fallback also fails, re-raise the original HF error so the diagnostic is preserved.
    • Update docstrings; clarify the splits = ["train"] invariant for HF's file-based builders.

Tests

tests/unit/torch/utils/test_dataset_utils.py — three new test classes (18 cases, all passing):

  • TestLocalJsonlLoading — text / messages / conversations / prompt+completion / input+output columns; num_samples honored; tools kwarg forwarded; ValueError on unrecognized columns; legacy text-key fallback on schema-unification failure.
  • TestGetDatasetDataloaderBlending — single JSONL, list of JSONL files concatenated, mixed-format JSONL files blended, length-mismatch assertion.
  • TestHfTinyDataset — uses hf-internal-testing/dataset_with_data_files (10 rows x {train, test}) for end-to-end coverage: single split, multiple splits, default split, HF -> JSONL -> reload round-trip, two-HF-dataset blending, HF + local-JSONL mixing.

Test plan

  • pytest tests/unit/torch/utils/test_dataset_utils.py — 28 passed
  • mypy modelopt/torch/utils/dataset_utils.py — no new errors
  • ruff check — clean

Summary by CodeRabbit

  • New Features

    • Local JSONL files now load through the HuggingFace JSON pipeline so local and HF datasets share the same auto-detection and preprocessing; lists can mix dataset identifiers and JSONL paths (local JSONL restricted to train).
  • Bug Fixes

    • Fields explicitly set to null/None are no longer treated as present text, avoiding incorrect concatenation or chat-template application.
    • Narrower legacy fallback with clearer warnings when schema/parsing failures occur.
  • Documentation

    • Updated to describe mixed-source handling and concatenation-before-tokenization.
  • Tests

    • Expanded unit and integration tests covering schemas, sparse/null handling, concatenation, sample limits, tool forwarding, and fallback paths.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Local .jsonl inputs are routed through HuggingFace load_dataset(..., builder="json", streaming=True) (when available) so they undergo the same auto-detection/preprocessing as HF datasets; detection now treats present-but-falsy fields as non-null and only concatenates optional fields when non-None. On HF JSON/Arrow/schema failures a warning is emitted and a legacy line-by-line text fallback is used. Tests and docs updated; get_dataset_dataloader docs now note mixing HF ids and .jsonl paths.

Changes

Dataset utils implementation

Layer / File(s) Summary
Detection / Data-shape logic
modelopt/torch/utils/dataset_utils.py
_auto_preprocess_sample now checks presence via sample.get(key) is not None so present-but-falsy values (e.g., empty strings) are treated as present; tools and optional completion/response/output parts are concatenated only when non-None.
Loading core
modelopt/torch/utils/dataset_utils.py
get_dataset_samples uses lazy datasets import and calls load_dataset(..., builder="json", streaming=True) for local .jsonl files and unregistered dataset ids so local JSONL flows through the same auto-preprocess path as HF datasets.
Error handling / fallback
modelopt/torch/utils/dataset_utils.py
Restricts legacy fallback to JSON/Arrow schema failures by catching DatasetGenerationError/ArrowInvalid, emits a warning when falling back, and re-raises other exceptions. Validates that local .jsonl requests cannot request non-train splits and checks for empty split=[].
Docs / API note
modelopt/torch/utils/dataset_utils.py
Docstring for get_dataset_dataloader updated to state list inputs may mix HF ids and .jsonl paths; samples are concatenated before tokenization.

Tests and configuration

Layer / File(s) Summary
Test utilities / fake tokenizer
tests/unit/torch/utils/test_dataset_utils.py
Adds _FakeTokenizer and helpers for local JSONL test fixtures.
Unit tests: local JSONL detection
tests/unit/torch/utils/test_dataset_utils.py
Adds tests covering auto-detection/formatting for text, chat-style messages/conversations (including per-row tools forwarded to tokenizer.apply_chat_template), prompt+completion, input+output, handling of None vs empty-string fields, and num_samples truncation.
Unit tests: fallback / edge-cases
tests/unit/torch/utils/test_dataset_utils.py
Adds tests for unrecognized-column failure, empty split=[], and legacy text fallback when HF JSON builder raises JSON/Arrow errors.
Dataloader blending tests
tests/unit/torch/utils/test_dataset_utils.py
Adds tests validating get_dataset_dataloader concatenation across single JSONL, multiple JSONLs, mixed JSONL formats, and asserts dataset_name/num_samples list-length mismatch raises.
Integration / network tests
tests/unit/torch/utils/test_dataset_utils.py
Adds live HF tiny-dataset integration: split semantics (default train), equivalence between direct HF loading and dumped JSONL, and dataloader blending across HF datasets and HF+local JSONL.
Pytest markers
pyproject.toml
Adds an integration pytest marker and reorders nearby markers (manual, network) in config.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Detector as get_dataset_samples
    participant HF as HuggingFace LoadDataset
    participant Preprocess as AutoPreprocess
    participant Fallback as LegacyReader
    participant Return

    Client->>Detector: call with HF ids and/or `.jsonl` paths
    Detector->>Detector: classify inputs (HF id vs local `.jsonl`)
    alt includes `.jsonl`
        Detector->>HF: load_dataset(builder="json", data_files=..., streaming=true)
        alt HF load + schema unification succeeds
            HF->>Preprocess: stream rows for auto-detection
            Preprocess->>Return: emit normalized samples
        else HF raises JSON/Arrow schema error
            HF--xDetector: raises DatasetGenerationError/ArrowInvalid
            Detector->>Fallback: read file line-by-line, extract `text`
            Fallback->>Return: emit fallback samples (with warning)
        end
    end
    alt includes HF ids
        Detector->>HF: load_dataset(id, split=...)
        HF->>Preprocess: return samples
        Preprocess->>Return: emit normalized samples
    end
    Return->>Client: concatenated sample stream
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.58% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main change: aligning local JSONL loading with HuggingFace datasets while preserving original behavior, as confirmed by the raw_summary and PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns detected. All checks passed: no unsafe torch.load, numpy.load, trust_remote_code, eval/exec, or # nosec comments found. No new unsafe dependencies added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shengliangx/local-jsonl-dataset

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1345/

Built to branch gh-pages at 2026-05-06 00:11 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@shengliangxu shengliangxu changed the title feat: align local JSONL loading with HF datasets path + keep original behaviour [OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour Apr 24, 2026
@shengliangxu shengliangxu marked this pull request as ready for review April 24, 2026 23:40
@shengliangxu shengliangxu requested a review from a team as a code owner April 24, 2026 23:40
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 91.22807% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.88%. Comparing base (f34f488) to head (77fe4cc).

Files with missing lines Patch % Lines
modelopt/torch/utils/dataset_utils.py 91.22% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1345       +/-   ##
===========================================
+ Coverage   60.49%   76.88%   +16.39%     
===========================================
  Files         475      476        +1     
  Lines       51238    51336       +98     
===========================================
+ Hits        30997    39471     +8474     
+ Misses      20241    11865     -8376     
Flag Coverage Δ
examples 41.80% <50.87%> (+9.53%) ⬆️
gpu 59.82% <50.87%> (+43.81%) ⬆️
regression 15.19% <0.00%> (+0.06%) ⬆️
unit 52.65% <89.47%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)

445-506: Consider adding a network-skip marker for CI reliability.

These tests download from hf-internal-testing/* datasets, which are stable but still involve network I/O. If CI environments have unreliable network access, these tests may flake.

Consider marking them with a custom pytest marker (e.g., @pytest.mark.network) so they can be selectively skipped in constrained environments, while still running in standard CI.

Example marker usage
`@pytest.mark.network`
class TestHfTinyDataset:
    """End-to-end coverage with a real (tiny) HF dataset."""
    ...

Then configure pytest.ini or pyproject.toml:

markers =
    network: tests that require network access to HuggingFace Hub
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 445 - 506, Add a
network-skippable marker to the TestHfTinyDataset test class so CI can skip HF
Hub network tests when desired: annotate the TestHfTinyDataset class with a
pytest marker such as `@pytest.mark.network` (referencing the TestHfTinyDataset
class and tests like test_load_single_split_directly,
test_dataloader_blending_two_hf_datasets, etc.), and add the corresponding
marker declaration ("network: tests that require network access to HuggingFace
Hub") to pytest.ini or pyproject.toml so pytest recognizes it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 445-506: Add a network-skippable marker to the TestHfTinyDataset
test class so CI can skip HF Hub network tests when desired: annotate the
TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network`
(referencing the TestHfTinyDataset class and tests like
test_load_single_split_directly, test_dataloader_blending_two_hf_datasets,
etc.), and add the corresponding marker declaration ("network: tests that
require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so
pytest recognizes it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e2e1cb6a-ec09-41ed-ab40-d1b508f61467

📥 Commits

Reviewing files that changed from the base of the PR and between 7c80d85 and 8f2f0ae.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)

511-589: Consider gating live Hub tests as integration to reduce unit-test flakiness

These cases are valuable, but remote dataset availability/network instability can make tests/unit non-deterministic. Consider marking them (e.g., @pytest.mark.integration) or guarding behind an env flag in CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 511 - 589, The
live HF dataset tests in TestHfTinyDataset (e.g.,
test_load_single_split_directly, test_load_multiple_splits_directly,
test_default_split_is_train, test_download_to_jsonl_then_load,
test_dataloader_blending_two_hf_datasets,
test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be
gated as integration tests: either add a pytest marker (e.g., annotate the
TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and
register that marker in pytest.ini, or guard execution with an environment flag
check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit
test runs remain deterministic in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 351-354: The current warn call logs the full local_dataset_path
and raw exception text (e), which can leak sensitive filesystem or error payload
details; update the warning in the dataset load fallback so it does not include
raw paths or exception messages—use a sanitized identifier (e.g.,
Path(local_dataset_path).name or other non-sensitive dataset id) and only
include the exception type (type(e).__name__) or a generic phrase like "loader
error" instead of including {e} or the full path; modify the warn(...)
invocation accordingly (referencing warn, local_dataset_path, and e) so logs
avoid exposing sensitive content.
- Around line 337-356: The current except block around HF JSON parsing (the
except Exception as e handling is_jsonl) falls back to get_jsonl_text_samples
too broadly and can hide caller errors like invalid splits; update the except
block in dataset_utils.py to only attempt the legacy text-field fallback when
the original HF error is clearly a schema/column inference or parsing error, and
re-raise immediately for errors that indicate caller/config issues (e.g., the
exception message or type references a missing/unsupported split, "split", "not
found", or similar); implement this by inspecting e (str(e) and/or its type)
before calling get_jsonl_text_samples(local_dataset_path, num_samples,
key="text") so invalid split/config errors are not silently recovered from.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 395-408: The test currently only checks output values so it can
pass without exercising the fallback branch; modify
test_legacy_text_fallback_on_hf_builder_failure to force the HF json builder to
raise and/or assert the fallback path was executed: patch or monkeypatch the HF
loading function used inside get_dataset_samples (e.g., the
datasets.load_dataset / DatasetBuilder call) to raise an exception, then call
get_dataset_samples(path, num_samples=3) and verify both the returned samples
and that the legacy reader function inside get_dataset_samples (the code path
handling text-field fallback) was invoked (for example by spying/mocking that
fallback helper or checking a logged/returned indicator). Ensure references to
get_dataset_samples, test_legacy_text_fallback_on_hf_builder_failure, and
_write_jsonl are used to locate the code to change.

---

Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 511-589: The live HF dataset tests in TestHfTinyDataset (e.g.,
test_load_single_split_directly, test_load_multiple_splits_directly,
test_default_split_is_train, test_download_to_jsonl_then_load,
test_dataloader_blending_two_hf_datasets,
test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be
gated as integration tests: either add a pytest marker (e.g., annotate the
TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and
register that marker in pytest.ini, or guard execution with an environment flag
check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit
test runs remain deterministic in CI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24c5dbc5-fa7a-4e97-9451-2c88a59bb346

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2f0ae and b874912.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment thread tests/unit/torch/utils/test_dataset_utils.py Outdated
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)

391-394: ⚠️ Potential issue | 🟠 Major

Sanitize fallback warning to avoid leaking local path/error payloads.

This warning still includes local_dataset_path and raw exception text e, which can leak sensitive filesystem details or data-derived parser payloads for untrusted JSONL inputs.

Proposed fix
-        warn(
-            f"Failed to load {local_dataset_path} via the HF 'json' builder "
-            f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
-        )
+        safe_name = Path(local_dataset_path).name
+        warn(
+            f"Failed to load JSONL file '{safe_name}' via HF 'json' builder "
+            f"({type(e).__name__}); fell back to legacy text-field reader."
+        )

As per coding guidelines, SECURITY.md requires avoiding sensitive paths/content in warnings/errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 391 - 394, The warning
leaks sensitive filesystem paths and raw exception text; update the warn call in
the dataset loader (the block referencing local_dataset_path and exception e) to
remove local_dataset_path and the full exception payload and instead emit a
sanitized message (e.g., "Failed to load dataset via HF 'json' builder; falling
back to legacy reader") and, if needed, include only the exception class name
(type(e).__name__) or a short safe error code; if you must preserve full details
for debugging, send them to a debug-level logger instead of the user-facing
warn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 391-394: The warning leaks sensitive filesystem paths and raw
exception text; update the warn call in the dataset loader (the block
referencing local_dataset_path and exception e) to remove local_dataset_path and
the full exception payload and instead emit a sanitized message (e.g., "Failed
to load dataset via HF 'json' builder; falling back to legacy reader") and, if
needed, include only the exception class name (type(e).__name__) or a short safe
error code; if you must preserve full details for debugging, send them to a
debug-level logger instead of the user-facing warn.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 130f9e2e-967d-4ec7-93fc-d93a676bc488

📥 Commits

Reviewing files that changed from the base of the PR and between b874912 and 9a94525.

📒 Files selected for processing (1)
  • modelopt/torch/utils/dataset_utils.py

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)

181-212: ⚠️ Potential issue | 🟠 Major

Don't conflate None with other falsy column values.

These sample.get(...) checks also skip valid empty-string fields, so rows like {"prompt": "", "completion": "A"} or {"input": "", "output": "A"} now miss auto-detection and fail even though only None needs special handling here.

Proposed fix
-    chat_key = next((k for k in ("messages", "conversations") if sample.get(k)), None)
+    def _has_non_null_value(key: str) -> bool:
+        return key in sample and sample[key] is not None
+
+    chat_key = next((k for k in ("messages", "conversations") if _has_non_null_value(k)), None)
@@
-    if sample.get("prompt"):
+    if _has_non_null_value("prompt"):
         parts = [sample["prompt"]]
-        parts.extend(sample[k] for k in ("completion", "response", "output") if sample.get(k))
+        parts.extend(
+            sample[k]
+            for k in ("completion", "response", "output")
+            if _has_non_null_value(k)
+        )
         return "\n".join(parts)
@@
-    if sample.get("text"):
+    if _has_non_null_value("text"):
         return sample["text"]
@@
-    if sample.get("input"):
+    if _has_non_null_value("input"):
         parts = [sample["input"]]
-        if sample.get("output"):
+        if _has_non_null_value("output"):
             parts.append(sample["output"])
         return "\n".join(parts)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 181 - 212, The checks
using truthy sample.get(...) incorrectly treat valid empty-string fields as
missing; change them to explicit None checks so empty strings are accepted:
replace the chat_key detection with next((k for k in
("messages","conversations") if sample.get(k) is not None), None), change tools
handling to if tools is not None, and update the subsequent conditionals (if
sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any
sample.get("completion"/"response"/"output" checks) to use "is not None" so
tokenizer.apply_chat_template, the prompt/completion join logic, and
input/output handling treat "" as valid while still skipping None.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)

397-400: ⚠️ Potential issue | 🟠 Major

Sanitize the JSONL fallback warning.

This still logs the full local path and raw exception text for untrusted dataset inputs. Please keep the warning to a safe identifier plus the exception type.

Proposed fix
-        warn(
-            f"Failed to load {local_dataset_path} via the HF 'json' builder "
-            f"({type(e).__name__}: {e}); fell back to legacy text-field reader."
-        )
+        warn(
+            f"Failed to load JSONL file '{Path(local_dataset_path).name}' via the HF "
+            f"'json' builder ({type(e).__name__}); fell back to legacy text-field reader."
+        )

As per coding guidelines, SECURITY.md says: “Be careful not to log sensitive data (paths, tokens, proprietary content).”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 397 - 400, The warning
currently logs the full local path and raw exception text; change the warn call
in dataset_utils.py to avoid sensitive data by removing the full path and
exception message and instead log a safe identifier plus the exception type. Use
a safe identifier such as os.path.basename(local_dataset_path) or an existing
dataset identifier variable (e.g., dataset_name) in place of local_dataset_path,
and keep only type(e).__name__ (remove {e}) in the message for the warn(...)
call to produce: "Failed to load dataset '<safe_identifier>' via the HF 'json'
builder (<ExceptionType>); fell back to legacy text-field reader."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 343-373: After normalizing splits via _normalize_splits (or
defaulting to ["train"]), validate that the resulting splits list is not empty
and raise a ValueError if it is; place this check before calling load_dataset
and before computing num_per_split to avoid a ZeroDivisionError. Update the
logic around the splits variable (used when calling load_dataset(..., split=s)
and when computing num_per_split) to fail fast with a clear message like "empty
splits list after normalization" rather than allowing a division by zero.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 531-609: The TestHfTinyDataset tests use real HF data (_HF_TINY,
_hf_dump_to_jsonl) and call get_dataset_samples and get_dataset_dataloader which
invoke datasets.load_dataset; move these tests out of unit tests or mark them as
integration tests (e.g., add `@pytest.mark.integration` to TestHfTinyDataset or
each test) OR replace live calls by monkeypatching datasets.load_dataset (or
patch get_dataset_samples/get_dataset_dataloader to use local fixtures) so the
unit suite is hermetic and no network download occurs.

---

Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 181-212: The checks using truthy sample.get(...) incorrectly treat
valid empty-string fields as missing; change them to explicit None checks so
empty strings are accepted: replace the chat_key detection with next((k for k in
("messages","conversations") if sample.get(k) is not None), None), change tools
handling to if tools is not None, and update the subsequent conditionals (if
sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any
sample.get("completion"/"response"/"output" checks) to use "is not None" so
tokenizer.apply_chat_template, the prompt/completion join logic, and
input/output handling treat "" as valid while still skipping None.

---

Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 397-400: The warning currently logs the full local path and raw
exception text; change the warn call in dataset_utils.py to avoid sensitive data
by removing the full path and exception message and instead log a safe
identifier plus the exception type. Use a safe identifier such as
os.path.basename(local_dataset_path) or an existing dataset identifier variable
(e.g., dataset_name) in place of local_dataset_path, and keep only
type(e).__name__ (remove {e}) in the message for the warn(...) call to produce:
"Failed to load dataset '<safe_identifier>' via the HF 'json' builder
(<ExceptionType>); fell back to legacy text-field reader."
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2976e442-d11c-40e3-ac14-90f57ac4afab

📥 Commits

Reviewing files that changed from the base of the PR and between 9a94525 and 409eb38.

📒 Files selected for processing (2)
  • modelopt/torch/utils/dataset_utils.py
  • tests/unit/torch/utils/test_dataset_utils.py

Comment thread modelopt/torch/utils/dataset_utils.py Outdated
Comment thread tests/unit/torch/utils/test_dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@shengliangxu shengliangxu requested a review from a team as a code owner May 5, 2026 00:46
@shengliangxu
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/utils/dataset_utils.py (1)

324-330: 💤 Low value

Minor inconsistency: tools check differs between registered and auto-preprocess paths.

Line 326 uses truthiness (if tools:) while line 194 in _auto_preprocess_sample uses if tools is not None:. This means an empty list tools=[] would be passed to apply_chat_template in the unregistered/JSONL path but not in the registered dataset path.

If the behavior should be consistent (skip empty tools in both paths), consider aligning to:

-                if tools:
+                if tools is not None:

Or if empty tools should be excluded in both paths, update line 194 instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/dataset_utils.py` around lines 324 - 330, The truthiness
check for tools in the unregistered/JSONL path (inside the return that calls
tokenizer.apply_chat_template) differs from the registered path's check in
_auto_preprocess_sample; make the checks consistent by changing the conditional
around the local tools variable to use "if tools is not None:" (so both paths
treat an empty list the same), ensuring tokenizer.apply_chat_template receives
tools only when tools is not None; update the check near the call to
tokenizer.apply_chat_template in dataset_utils.py to match
_auto_preprocess_sample.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 324-330: The truthiness check for tools in the unregistered/JSONL
path (inside the return that calls tokenizer.apply_chat_template) differs from
the registered path's check in _auto_preprocess_sample; make the checks
consistent by changing the conditional around the local tools variable to use
"if tools is not None:" (so both paths treat an empty list the same), ensuring
tokenizer.apply_chat_template receives tools only when tools is not None; update
the check near the call to tokenizer.apply_chat_template in dataset_utils.py to
match _auto_preprocess_sample.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a36e2492-01e7-4e78-995c-85d463fd37f5

📥 Commits

Reviewing files that changed from the base of the PR and between 409eb38 and 6bcf408.

📒 Files selected for processing (3)
  • modelopt/torch/utils/dataset_utils.py
  • pyproject.toml
  • tests/unit/torch/utils/test_dataset_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/torch/utils/test_dataset_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant