[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345
[OMNIML-4021]: align local JSONL loading with HF datasets path + keep original behaviour#1345shengliangxu wants to merge 12 commits intomainfrom
Conversation
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughLocal ChangesDataset utils implementation
Tests and configuration
Sequence DiagramsequenceDiagram
participant Client
participant Detector as get_dataset_samples
participant HF as HuggingFace LoadDataset
participant Preprocess as AutoPreprocess
participant Fallback as LegacyReader
participant Return
Client->>Detector: call with HF ids and/or `.jsonl` paths
Detector->>Detector: classify inputs (HF id vs local `.jsonl`)
alt includes `.jsonl`
Detector->>HF: load_dataset(builder="json", data_files=..., streaming=true)
alt HF load + schema unification succeeds
HF->>Preprocess: stream rows for auto-detection
Preprocess->>Return: emit normalized samples
else HF raises JSON/Arrow schema error
HF--xDetector: raises DatasetGenerationError/ArrowInvalid
Detector->>Fallback: read file line-by-line, extract `text`
Fallback->>Return: emit fallback samples (with warning)
end
end
alt includes HF ids
Detector->>HF: load_dataset(id, split=...)
HF->>Preprocess: return samples
Preprocess->>Return: emit normalized samples
end
Return->>Client: concatenated sample stream
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1345 +/- ##
===========================================
+ Coverage 60.49% 76.88% +16.39%
===========================================
Files 475 476 +1
Lines 51238 51336 +98
===========================================
+ Hits 30997 39471 +8474
+ Misses 20241 11865 -8376
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)
445-506: Consider adding a network-skip marker for CI reliability.These tests download from
hf-internal-testing/*datasets, which are stable but still involve network I/O. If CI environments have unreliable network access, these tests may flake.Consider marking them with a custom pytest marker (e.g.,
@pytest.mark.network) so they can be selectively skipped in constrained environments, while still running in standard CI.Example marker usage
`@pytest.mark.network` class TestHfTinyDataset: """End-to-end coverage with a real (tiny) HF dataset.""" ...Then configure
pytest.iniorpyproject.toml:markers = network: tests that require network access to HuggingFace Hub🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 445 - 506, Add a network-skippable marker to the TestHfTinyDataset test class so CI can skip HF Hub network tests when desired: annotate the TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network` (referencing the TestHfTinyDataset class and tests like test_load_single_split_directly, test_dataloader_blending_two_hf_datasets, etc.), and add the corresponding marker declaration ("network: tests that require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so pytest recognizes it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 445-506: Add a network-skippable marker to the TestHfTinyDataset
test class so CI can skip HF Hub network tests when desired: annotate the
TestHfTinyDataset class with a pytest marker such as `@pytest.mark.network`
(referencing the TestHfTinyDataset class and tests like
test_load_single_split_directly, test_dataloader_blending_two_hf_datasets,
etc.), and add the corresponding marker declaration ("network: tests that
require network access to HuggingFace Hub") to pytest.ini or pyproject.toml so
pytest recognizes it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e2e1cb6a-ec09-41ed-ab40-d1b508f61467
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unit/torch/utils/test_dataset_utils.py (1)
511-589: Consider gating live Hub tests as integration to reduce unit-test flakinessThese cases are valuable, but remote dataset availability/network instability can make
tests/unitnon-deterministic. Consider marking them (e.g.,@pytest.mark.integration) or guarding behind an env flag in CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 511 - 589, The live HF dataset tests in TestHfTinyDataset (e.g., test_load_single_split_directly, test_load_multiple_splits_directly, test_default_split_is_train, test_download_to_jsonl_then_load, test_dataloader_blending_two_hf_datasets, test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be gated as integration tests: either add a pytest marker (e.g., annotate the TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and register that marker in pytest.ini, or guard execution with an environment flag check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit test runs remain deterministic in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 351-354: The current warn call logs the full local_dataset_path
and raw exception text (e), which can leak sensitive filesystem or error payload
details; update the warning in the dataset load fallback so it does not include
raw paths or exception messages—use a sanitized identifier (e.g.,
Path(local_dataset_path).name or other non-sensitive dataset id) and only
include the exception type (type(e).__name__) or a generic phrase like "loader
error" instead of including {e} or the full path; modify the warn(...)
invocation accordingly (referencing warn, local_dataset_path, and e) so logs
avoid exposing sensitive content.
- Around line 337-356: The current except block around HF JSON parsing (the
except Exception as e handling is_jsonl) falls back to get_jsonl_text_samples
too broadly and can hide caller errors like invalid splits; update the except
block in dataset_utils.py to only attempt the legacy text-field fallback when
the original HF error is clearly a schema/column inference or parsing error, and
re-raise immediately for errors that indicate caller/config issues (e.g., the
exception message or type references a missing/unsupported split, "split", "not
found", or similar); implement this by inspecting e (str(e) and/or its type)
before calling get_jsonl_text_samples(local_dataset_path, num_samples,
key="text") so invalid split/config errors are not silently recovered from.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 395-408: The test currently only checks output values so it can
pass without exercising the fallback branch; modify
test_legacy_text_fallback_on_hf_builder_failure to force the HF json builder to
raise and/or assert the fallback path was executed: patch or monkeypatch the HF
loading function used inside get_dataset_samples (e.g., the
datasets.load_dataset / DatasetBuilder call) to raise an exception, then call
get_dataset_samples(path, num_samples=3) and verify both the returned samples
and that the legacy reader function inside get_dataset_samples (the code path
handling text-field fallback) was invoked (for example by spying/mocking that
fallback helper or checking a logged/returned indicator). Ensure references to
get_dataset_samples, test_legacy_text_fallback_on_hf_builder_failure, and
_write_jsonl are used to locate the code to change.
---
Nitpick comments:
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 511-589: The live HF dataset tests in TestHfTinyDataset (e.g.,
test_load_single_split_directly, test_load_multiple_splits_directly,
test_default_split_is_train, test_download_to_jsonl_then_load,
test_dataloader_blending_two_hf_datasets,
test_dataloader_mixing_hf_and_local_jsonl) are network-dependent and should be
gated as integration tests: either add a pytest marker (e.g., annotate the
TestHfTinyDataset class with `@pytest.mark.integration` or decorate each test) and
register that marker in pytest.ini, or guard execution with an environment flag
check (e.g., skip unless os.environ.get("RUN_INTEGRATION_TESTS") is set) so unit
test runs remain deterministic in CI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 24c5dbc5-fa7a-4e97-9451-2c88a59bb346
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)
391-394:⚠️ Potential issue | 🟠 MajorSanitize fallback warning to avoid leaking local path/error payloads.
This warning still includes
local_dataset_pathand raw exception texte, which can leak sensitive filesystem details or data-derived parser payloads for untrusted JSONL inputs.Proposed fix
- warn( - f"Failed to load {local_dataset_path} via the HF 'json' builder " - f"({type(e).__name__}: {e}); fell back to legacy text-field reader." - ) + safe_name = Path(local_dataset_path).name + warn( + f"Failed to load JSONL file '{safe_name}' via HF 'json' builder " + f"({type(e).__name__}); fell back to legacy text-field reader." + )As per coding guidelines, SECURITY.md requires avoiding sensitive paths/content in warnings/errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 391 - 394, The warning leaks sensitive filesystem paths and raw exception text; update the warn call in the dataset loader (the block referencing local_dataset_path and exception e) to remove local_dataset_path and the full exception payload and instead emit a sanitized message (e.g., "Failed to load dataset via HF 'json' builder; falling back to legacy reader") and, if needed, include only the exception class name (type(e).__name__) or a short safe error code; if you must preserve full details for debugging, send them to a debug-level logger instead of the user-facing warn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 391-394: The warning leaks sensitive filesystem paths and raw
exception text; update the warn call in the dataset loader (the block
referencing local_dataset_path and exception e) to remove local_dataset_path and
the full exception payload and instead emit a sanitized message (e.g., "Failed
to load dataset via HF 'json' builder; falling back to legacy reader") and, if
needed, include only the exception class name (type(e).__name__) or a short safe
error code; if you must preserve full details for debugging, send them to a
debug-level logger instead of the user-facing warn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 130f9e2e-967d-4ec7-93fc-d93a676bc488
📒 Files selected for processing (1)
modelopt/torch/utils/dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)
181-212:⚠️ Potential issue | 🟠 MajorDon't conflate
Nonewith other falsy column values.These
sample.get(...)checks also skip valid empty-string fields, so rows like{"prompt": "", "completion": "A"}or{"input": "", "output": "A"}now miss auto-detection and fail even though onlyNoneneeds special handling here.Proposed fix
- chat_key = next((k for k in ("messages", "conversations") if sample.get(k)), None) + def _has_non_null_value(key: str) -> bool: + return key in sample and sample[key] is not None + + chat_key = next((k for k in ("messages", "conversations") if _has_non_null_value(k)), None) @@ - if sample.get("prompt"): + if _has_non_null_value("prompt"): parts = [sample["prompt"]] - parts.extend(sample[k] for k in ("completion", "response", "output") if sample.get(k)) + parts.extend( + sample[k] + for k in ("completion", "response", "output") + if _has_non_null_value(k) + ) return "\n".join(parts) @@ - if sample.get("text"): + if _has_non_null_value("text"): return sample["text"] @@ - if sample.get("input"): + if _has_non_null_value("input"): parts = [sample["input"]] - if sample.get("output"): + if _has_non_null_value("output"): parts.append(sample["output"]) return "\n".join(parts)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 181 - 212, The checks using truthy sample.get(...) incorrectly treat valid empty-string fields as missing; change them to explicit None checks so empty strings are accepted: replace the chat_key detection with next((k for k in ("messages","conversations") if sample.get(k) is not None), None), change tools handling to if tools is not None, and update the subsequent conditionals (if sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any sample.get("completion"/"response"/"output" checks) to use "is not None" so tokenizer.apply_chat_template, the prompt/completion join logic, and input/output handling treat "" as valid while still skipping None.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)
397-400:⚠️ Potential issue | 🟠 MajorSanitize the JSONL fallback warning.
This still logs the full local path and raw exception text for untrusted dataset inputs. Please keep the warning to a safe identifier plus the exception type.
Proposed fix
- warn( - f"Failed to load {local_dataset_path} via the HF 'json' builder " - f"({type(e).__name__}: {e}); fell back to legacy text-field reader." - ) + warn( + f"Failed to load JSONL file '{Path(local_dataset_path).name}' via the HF " + f"'json' builder ({type(e).__name__}); fell back to legacy text-field reader." + )As per coding guidelines, SECURITY.md says: “Be careful not to log sensitive data (paths, tokens, proprietary content).”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 397 - 400, The warning currently logs the full local path and raw exception text; change the warn call in dataset_utils.py to avoid sensitive data by removing the full path and exception message and instead log a safe identifier plus the exception type. Use a safe identifier such as os.path.basename(local_dataset_path) or an existing dataset identifier variable (e.g., dataset_name) in place of local_dataset_path, and keep only type(e).__name__ (remove {e}) in the message for the warn(...) call to produce: "Failed to load dataset '<safe_identifier>' via the HF 'json' builder (<ExceptionType>); fell back to legacy text-field reader."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 343-373: After normalizing splits via _normalize_splits (or
defaulting to ["train"]), validate that the resulting splits list is not empty
and raise a ValueError if it is; place this check before calling load_dataset
and before computing num_per_split to avoid a ZeroDivisionError. Update the
logic around the splits variable (used when calling load_dataset(..., split=s)
and when computing num_per_split) to fail fast with a clear message like "empty
splits list after normalization" rather than allowing a division by zero.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 531-609: The TestHfTinyDataset tests use real HF data (_HF_TINY,
_hf_dump_to_jsonl) and call get_dataset_samples and get_dataset_dataloader which
invoke datasets.load_dataset; move these tests out of unit tests or mark them as
integration tests (e.g., add `@pytest.mark.integration` to TestHfTinyDataset or
each test) OR replace live calls by monkeypatching datasets.load_dataset (or
patch get_dataset_samples/get_dataset_dataloader to use local fixtures) so the
unit suite is hermetic and no network download occurs.
---
Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 181-212: The checks using truthy sample.get(...) incorrectly treat
valid empty-string fields as missing; change them to explicit None checks so
empty strings are accepted: replace the chat_key detection with next((k for k in
("messages","conversations") if sample.get(k) is not None), None), change tools
handling to if tools is not None, and update the subsequent conditionals (if
sample.get("prompt"), if sample.get("text"), if sample.get("input"), and any
sample.get("completion"/"response"/"output" checks) to use "is not None" so
tokenizer.apply_chat_template, the prompt/completion join logic, and
input/output handling treat "" as valid while still skipping None.
---
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 397-400: The warning currently logs the full local path and raw
exception text; change the warn call in dataset_utils.py to avoid sensitive data
by removing the full path and exception message and instead log a safe
identifier plus the exception type. Use a safe identifier such as
os.path.basename(local_dataset_path) or an existing dataset identifier variable
(e.g., dataset_name) in place of local_dataset_path, and keep only
type(e).__name__ (remove {e}) in the message for the warn(...) call to produce:
"Failed to load dataset '<safe_identifier>' via the HF 'json' builder
(<ExceptionType>); fell back to legacy text-field reader."
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2976e442-d11c-40e3-ac14-90f57ac4afab
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/utils/dataset_utils.py (1)
324-330: 💤 Low valueMinor inconsistency:
toolscheck differs between registered and auto-preprocess paths.Line 326 uses truthiness (
if tools:) while line 194 in_auto_preprocess_sampleusesif tools is not None:. This means an empty listtools=[]would be passed toapply_chat_templatein the unregistered/JSONL path but not in the registered dataset path.If the behavior should be consistent (skip empty tools in both paths), consider aligning to:
- if tools: + if tools is not None:Or if empty tools should be excluded in both paths, update line 194 instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/dataset_utils.py` around lines 324 - 330, The truthiness check for tools in the unregistered/JSONL path (inside the return that calls tokenizer.apply_chat_template) differs from the registered path's check in _auto_preprocess_sample; make the checks consistent by changing the conditional around the local tools variable to use "if tools is not None:" (so both paths treat an empty list the same), ensuring tokenizer.apply_chat_template receives tools only when tools is not None; update the check near the call to tokenizer.apply_chat_template in dataset_utils.py to match _auto_preprocess_sample.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 324-330: The truthiness check for tools in the unregistered/JSONL
path (inside the return that calls tokenizer.apply_chat_template) differs from
the registered path's check in _auto_preprocess_sample; make the checks
consistent by changing the conditional around the local tools variable to use
"if tools is not None:" (so both paths treat an empty list the same), ensuring
tokenizer.apply_chat_template receives tools only when tools is not None; update
the check near the call to tokenizer.apply_chat_template in dataset_utils.py to
match _auto_preprocess_sample.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a36e2492-01e7-4e78-995c-85d463fd37f5
📒 Files selected for processing (3)
modelopt/torch/utils/dataset_utils.pypyproject.tomltests/unit/torch/utils/test_dataset_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/torch/utils/test_dataset_utils.py
Summary
Local
.jsonlpaths fed toget_dataset_samples/get_dataset_dataloaderpreviously went through atext-key-only reader, while HF dataset names flowed through an auto-preprocess pipeline that recognizesmessages/conversations/prompt/text/inputcolumns. This split meant a calibration dataset behaved differently depending on whether it lived on HF Hub or on disk.This PR routes local
.jsonlthrough HF'sjsonbuilder and the same auto-preprocess pipeline so the format is detected from columns, not the source. The legacytext-field reader is preserved as a fallback for files where the HF builder fails (e.g. PyArrow schema unification across heterogeneous rows). Existing callers passing a plain{"text": ...}JSONL file keep working unchanged; chat-shaped JSONL now works without a separate code path.Changes
modelopt/torch/utils/dataset_utils.py.jsonlpaths throughload_dataset(path="json", data_files=...)+_auto_preprocess_sample.try/except; on failure for.jsonl, fall back toget_jsonl_text_samples. If the fallback also fails, re-raise the original HF error so the diagnostic is preserved.splits = ["train"]invariant for HF's file-based builders.Tests
tests/unit/torch/utils/test_dataset_utils.py— three new test classes (18 cases, all passing):TestLocalJsonlLoading— text / messages / conversations / prompt+completion / input+output columns;num_sampleshonored;toolskwarg forwarded; ValueError on unrecognized columns; legacytext-key fallback on schema-unification failure.TestGetDatasetDataloaderBlending— single JSONL, list of JSONL files concatenated, mixed-format JSONL files blended, length-mismatch assertion.TestHfTinyDataset— useshf-internal-testing/dataset_with_data_files(10 rows x {train, test}) for end-to-end coverage: single split, multiple splits, default split, HF -> JSONL -> reload round-trip, two-HF-dataset blending, HF + local-JSONL mixing.Test plan
pytest tests/unit/torch/utils/test_dataset_utils.py— 28 passedmypy modelopt/torch/utils/dataset_utils.py— no new errorsruff check— cleanSummary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests