Skip to content

Handle invalid environment rewards explicitly across environments, rollouts, and GRPO#2207

Draft
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-431-invalid-reward-mask
Draft

Handle invalid environment rewards explicitly across environments, rollouts, and GRPO#2207
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-431-invalid-reward-mask

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

This draft PR adds explicit invalid-reward handling for verifier-backed environments instead of overloading 0.0 to mean both "valid zero reward" and "reward system failure".

It introduces a batched reward_valid_mask signal on EnvironmentReturn, threads that mask through reward collation and rollout state, and uses it in GRPO advantage/statistics paths so invalid rewards do not contaminate training metrics or policy loss.

Addresses #431.

Root Cause

Today, several verifier-backed environments catch internal verification failures and convert them to numeric 0.0 rewards. Once that happens, the training stack cannot distinguish:

  • a legitimate zero reward from the environment, and
  • a failed verification/scoring step that was collapsed to zero.

Because the reward path is tensor-based end-to-end, those invalid rewards were then treated as ordinary samples by rollout aggregation, baseline/std computation, logging, and loss masking.

What Changed

Environment contract and reward producers

  • Added optional reward_valid_mask to EnvironmentReturn.
  • Updated verifier-backed environments to emit explicit validity alongside numeric rewards:
    • math_environment.py
    • code_jaccard_environment.py
    • vlm_environment.py
  • Kept rewards numeric for compatibility and marked verifier failures as invalid instead of relying on sentinel values like None or NaN.
  • Fixed a bug in HFVerifyWorker where the default verifier implementation was computed but not actually used when the kwarg was omitted.

Reward collation and rollout plumbing

  • Normalized missing reward masks to all-valid in calculate_rewards() for backward compatibility.
  • Preserved reward validity ordering when grouping/sorting environment outputs by task.
  • Propagated sample-level reward validity through both sync and async multi-turn rollout paths.
  • Added rollout metrics for invalid reward count/rate.

GRPO and advantage computation

  • Added helper normalization/validation for batch loss multipliers and reward-valid masks in grpo.py.
  • Excluded invalid rewards from GRPO baseline/std calculations and reward statistics.
  • Zeroed the effective sample loss multiplier for invalid-reward samples.
  • Logged reward-validity metadata in train/validation JSONL outputs and surfaced invalid reward metrics in console/logger metrics.
  • Updated GRPO, GDPO, and Reinforce++ advantage estimators to respect sample validity.
  • Hardened Reinforce++ so invalid samples remain masked even when KL-in-reward is enabled.
  • Avoided edge-case normalization issues for tiny valid sample sets in GDPO and validation logging.

Why This Design

This keeps the existing reward tensors intact while introducing a single explicit validity signal that composes cleanly with the existing masking model:

  • valid rewards still behave exactly as before
  • invalid rewards no longer affect reward statistics or policy updates
  • unaffected environments can stay unchanged and inherit the all-valid default path

That makes the implementation targeted and low-complexity while solving the ambiguity described in #431.

Tests and Validation

Added and updated focused unit coverage for:

  • verifier failure -> invalid reward mask in math/code jaccard environments
  • reward-valid-mask ordering/defaulting in reward collation
  • reward-valid-mask propagation in sync and async rollouts
  • GRPO/GDPO/Reinforce++ estimator masking behavior
  • the HFVerifyWorker default verifier path

Checks run locally in the isolated worktree:

  • python3 -m py_compile on touched source and test files
  • git diff --check
  • focused direct verification script covering rollout propagation, estimator masking, and verifier success/failure cases

Full Ray-backed pytest was not runnable in this sandbox because local Ray startup timed out, and the GRPO test module also depends on optional Megatron imports not present in the environment used here.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants