perf(pt_expt): use inductor+dynamic for torch.compile training#5393
perf(pt_expt): use inductor+dynamic for torch.compile training#5393wanghan-iapcm wants to merge 6 commits intodeepmodeling:masterfrom
Conversation
Add max_autotune, epilogue_fusion, triton.cudagraphs, max_fusion_size options to match the reference implementation.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughTraining compilation now uses shape‑polymorphic symbolic FX tracing and fully dynamic Changes
Sequence Diagram(s)sequenceDiagram
participant Loader as DataLoader
participant Trainer as TrainingLoop
participant Tracer as make_fx (symbolic)
participant Compiler as torch.compile (Inductor, dynamic)
participant Runtime as CompiledModel
Loader->>Trainer: provide sample batch (varying nall)
Trainer->>Tracer: trace forward_lower (symbolic)
Tracer-->>Trainer: symbolic FX graph
Trainer->>Compiler: torch.compile(graph, backend=inductor, dynamic=True)
Compiler-->>Trainer: compiled callable
Trainer->>Runtime: wrap compiled callable
Loader->>Trainer: subsequent batches (varying nall)
Trainer->>Runtime: execute compiled forward per batch
Runtime-->>Trainer: outputs (loss, gradients handled by dynamic runtime)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
source/tests/pt_expt/test_training.py (1)
167-177:⚠️ Potential issue | 🟡 MinorAdd the required 60s timeout to this training test.
This new training-path test does not set the repository’s required timeout, so a compile regression can hang CI instead of failing fast.
As per coding guidelines,
**/tests/**/*training*.py: Set training test timeouts to 60 seconds maximum for validation purposes, as real training takes hours or days.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/pt_expt/test_training.py` around lines 167 - 177, This training test lacks the repository-required 60s timeout; add a 60-second timeout to test_compiled_handles_varying_nall (in class TestCompiledDynamicShapes) by decorating the test method with a timeout decorator (e.g., `@pytest.mark.timeout`(60)) and import pytest at the top if not present, so the test will fail fast instead of hanging CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/tests/pt_expt/test_training.py`:
- Around line 199-211: The loop needs to record and assert that the
dynamic-shape path ran by capturing distinct nall values returned by the model
and use the trainer's stepping helper instead of calling optimizer.step()
directly: inside the loop keep a set (e.g. observed_nall) and extract nall from
the wrapper's extra output (the _more_loss/aux dict returned by
trainer.wrapper(**inp, cur_lr=lr, label=lab)), add it to the set, call the
trainer's stepping helper (use Trainer._optimizer_step or the public
trainer.step helper instead of trainer.optimizer.step()), and after the loop
assert that len(observed_nall) >= 2 to prove at least two different nall values
were seen; also retain the existing finite-loss assertions.
---
Outside diff comments:
In `@source/tests/pt_expt/test_training.py`:
- Around line 167-177: This training test lacks the repository-required 60s
timeout; add a 60-second timeout to test_compiled_handles_varying_nall (in class
TestCompiledDynamicShapes) by decorating the test method with a timeout
decorator (e.g., `@pytest.mark.timeout`(60)) and import pytest at the top if not
present, so the test will fail fast instead of hanging CI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2f7348ce-2f4d-4c7f-b068-a1d65cf52d0b
📒 Files selected for processing (2)
deepmd/pt_expt/train/training.pysource/tests/pt_expt/test_training.py
| # Run several training steps — each may have different nall | ||
| trainer.wrapper.train() | ||
| trainer.optimizer.zero_grad(set_to_none=True) | ||
| inp, lab = trainer.get_data(is_train=True) | ||
| lr = trainer.scheduler.get_last_lr()[0] | ||
| _, loss, more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | ||
| loss.backward() | ||
| trainer.optimizer.step() | ||
|
|
||
| # max_nall should have grown beyond 1 | ||
| new_max_nall = compiled_model._max_nall | ||
| self.assertGreater(new_max_nall, 1) | ||
|
|
||
| # compiled_forward_lower should be a new object | ||
| self.assertIsNot( | ||
| compiled_model.compiled_forward_lower, | ||
| old_compiled_lower, | ||
| ) | ||
|
|
||
| # Loss should be a finite scalar | ||
| self.assertFalse(torch.isnan(loss)) | ||
| self.assertFalse(torch.isinf(loss)) | ||
| for _ in range(3): | ||
| trainer.optimizer.zero_grad(set_to_none=True) | ||
| inp, lab = trainer.get_data(is_train=True) | ||
| lr = trainer.scheduler.get_last_lr()[0] | ||
| _, loss, _more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | ||
| loss.backward() | ||
| trainer.optimizer.step() | ||
|
|
||
| # Loss should be a finite scalar | ||
| self.assertFalse(torch.isnan(loss)) | ||
| self.assertFalse(torch.isinf(loss)) |
There was a problem hiding this comment.
Make this loop prove the dynamic-shape path was exercised.
Right now the test only checks that three compiled steps produce finite loss. It never asserts that any step actually hit a different nall, so it can pass without covering the regression this PR is meant to prevent. It also calls trainer.optimizer.step() directly, which skips Trainer._optimizer_step()’s scheduler update and @torch.compiler.disable guard, so the test no longer matches the real training path. Please record/assert at least two distinct observed nall values and use the trainer helper for stepping.
Minimal fix for the stepping path
- trainer.optimizer.step()
+ trainer._optimizer_step()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Run several training steps — each may have different nall | |
| trainer.wrapper.train() | |
| trainer.optimizer.zero_grad(set_to_none=True) | |
| inp, lab = trainer.get_data(is_train=True) | |
| lr = trainer.scheduler.get_last_lr()[0] | |
| _, loss, more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | |
| loss.backward() | |
| trainer.optimizer.step() | |
| # max_nall should have grown beyond 1 | |
| new_max_nall = compiled_model._max_nall | |
| self.assertGreater(new_max_nall, 1) | |
| # compiled_forward_lower should be a new object | |
| self.assertIsNot( | |
| compiled_model.compiled_forward_lower, | |
| old_compiled_lower, | |
| ) | |
| # Loss should be a finite scalar | |
| self.assertFalse(torch.isnan(loss)) | |
| self.assertFalse(torch.isinf(loss)) | |
| for _ in range(3): | |
| trainer.optimizer.zero_grad(set_to_none=True) | |
| inp, lab = trainer.get_data(is_train=True) | |
| lr = trainer.scheduler.get_last_lr()[0] | |
| _, loss, _more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | |
| loss.backward() | |
| trainer.optimizer.step() | |
| # Loss should be a finite scalar | |
| self.assertFalse(torch.isnan(loss)) | |
| self.assertFalse(torch.isinf(loss)) | |
| # Run several training steps — each may have different nall | |
| trainer.wrapper.train() | |
| for _ in range(3): | |
| trainer.optimizer.zero_grad(set_to_none=True) | |
| inp, lab = trainer.get_data(is_train=True) | |
| lr = trainer.scheduler.get_last_lr()[0] | |
| _, loss, _more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | |
| loss.backward() | |
| trainer._optimizer_step() | |
| # Loss should be a finite scalar | |
| self.assertFalse(torch.isnan(loss)) | |
| self.assertFalse(torch.isinf(loss)) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@source/tests/pt_expt/test_training.py` around lines 199 - 211, The loop needs
to record and assert that the dynamic-shape path ran by capturing distinct nall
values returned by the model and use the trainer's stepping helper instead of
calling optimizer.step() directly: inside the loop keep a set (e.g.
observed_nall) and extract nall from the wrapper's extra output (the
_more_loss/aux dict returned by trainer.wrapper(**inp, cur_lr=lr, label=lab)),
add it to the set, call the trainer's stepping helper (use
Trainer._optimizer_step or the public trainer.step helper instead of
trainer.optimizer.step()), and after the loop assert that len(observed_nall) >=
2 to prove at least two different nall values were seen; also retain the
existing finite-loss assertions.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
deepmd/pt_expt/train/training.py (1)
312-312:⚠️ Potential issue | 🟠 MajorRemove or use the unused local at Line 312
actual_nallis assigned but never read. This has already been flagged by prior scanning and may fail lint/static-analysis gates.Suggested change
- actual_nall = ext_coord.shape[1] out: dict[str, torch.Tensor] = {}As per coding guidelines:
**/*.py: Install linter and runruff check .before committing changes or the CI will fail.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/train/training.py` at line 312, Remove the unused local variable assignment actual_nall = ext_coord.shape[1] (it is assigned but never read); either delete this line or replace its usage where intended—search for actual_nall or ext_coord in the surrounding function in training.py and, if a column count is needed, use ext_coord.shape[1] directly or assign to a used variable name so the value is consumed.
🧹 Nitpick comments (1)
deepmd/pt_expt/train/training.py (1)
221-225: Avoid mutating caller-ownedcompile_optsin place
pop()/setdefault()currently mutate the dict from training config. A local copy is safer and avoids side effects if the same config is reused.Suggested change
def _trace_and_compile( @@ - # Override backend and dynamic — the inductor backend with + # Work on a local copy to avoid mutating caller-owned config. + compile_opts = deepcopy(compile_opts) + + # Override backend and dynamic — the inductor backend with # dynamic=True handles varying shapes automatically. compile_opts.pop("dynamic", None) compile_opts.pop("backend", None)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/train/training.py` around lines 221 - 225, The code mutates the caller-owned dict compile_opts using pop() and by adding "options"; instead create a shallow copy (e.g., local_compile_opts = compile_opts.copy()) and operate on that copy, then use local_compile_opts.pop("dynamic", None), local_compile_opts.pop("backend", None) and ensure "options" exists on local_compile_opts before setting opts = local_compile_opts["options"]; leave the original compile_opts untouched so callers can reuse it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@deepmd/pt_expt/train/training.py`:
- Line 312: Remove the unused local variable assignment actual_nall =
ext_coord.shape[1] (it is assigned but never read); either delete this line or
replace its usage where intended—search for actual_nall or ext_coord in the
surrounding function in training.py and, if a column count is needed, use
ext_coord.shape[1] directly or assign to a used variable name so the value is
consumed.
---
Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 221-225: The code mutates the caller-owned dict compile_opts using
pop() and by adding "options"; instead create a shallow copy (e.g.,
local_compile_opts = compile_opts.copy()) and operate on that copy, then use
local_compile_opts.pop("dynamic", None), local_compile_opts.pop("backend", None)
and ensure "options" exists on local_compile_opts before setting opts =
local_compile_opts["options"]; leave the original compile_opts untouched so
callers can reuse it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 547d2b4a-1e06-47cc-9b70-259db8391305
📒 Files selected for processing (1)
deepmd/pt_expt/train/training.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5393 +/- ##
==========================================
- Coverage 80.33% 80.32% -0.02%
==========================================
Files 819 819
Lines 85356 85316 -40
Branches 4139 4139
==========================================
- Hits 68571 68529 -42
- Misses 15509 15510 +1
- Partials 1276 1277 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e332640 to
d5fcd61
Compare
The env_mat_stat code calls descriptor.get_rcut_smth() and descriptor.get_env_protection() during compute_input_stats, but DescrptBlockRepformers only had the attributes without accessor methods, causing AttributeError when training DPA2 models.
d5fcd61 to
e48497c
Compare
Test get_rcut_smth() and get_env_protection() (default and custom values) on DescrptBlockRepformers.
Add silut/custom_silu activation to _torch_activation using native torch ops (torch.sigmoid, torch.tanh, torch.where) for make_fx compatibility. Add pt_expt unit tests and cross-backend consistency tests covering multiple thresholds with inputs spanning both silu and tanh branches.
Summary
aot_eagerbackend + manual nall padding withinductorbackend +dynamic=Truefor training compilationmake_fx(tracing_mode="symbolic")instead oftracing_mode="real"to capture shape-polymorphic opsshape_padding=True,max_autotune=False,epilogue_fusion=False,triton.cudagraphs=False,max_fusion_size=8_CompiledModel._recompile, max_nall estimation from 20 sampled batches, etc.)silut/custom_siluactivation support to pt_expt_torch_activation(needed for DPA3)Speed Benchmark (V100 GPU)
DPA1 (se_atten_compressible: rcut=6, sel=120, fitting=[240,240,240], float64)
DPA2 (input_torch_small, float32, bs=1)
DPA3 (silut:10.0, static sel, float32, bs=1)
Convergence Benchmark (1000 steps, V100)
DPA1 (se_atten_compressible) — rmse_f_val
DPA2 — val loss / e_rmse / f_rmse
DPA3 (silut:10.0) — val loss / e_rmse / f_rmse
All models converge comparably. Variation is within normal run-to-run noise from random seeds/batch ordering.
Note: DPA3 with
use_dynamic_sel: truecannot be compiled (data-dependentint()inget_graph_index). Only static sel configs are supported.Test plan