Handle invalid environment rewards explicitly across environments, rollouts, and GRPO by taivu1998 · Pull Request #2207 · NVIDIA-NeMo/RL

taivu1998 · 2026-04-03T17:55:22Z

Summary

This draft PR adds explicit invalid-reward handling for verifier-backed environments instead of overloading 0.0 to mean both "valid zero reward" and "reward system failure".

It introduces a batched reward_valid_mask signal on EnvironmentReturn, threads that mask through reward collation and rollout state, and uses it in GRPO advantage/statistics paths so invalid rewards do not contaminate training metrics or policy loss.

Addresses #431.

Root Cause

Today, several verifier-backed environments catch internal verification failures and convert them to numeric 0.0 rewards. Once that happens, the training stack cannot distinguish:

a legitimate zero reward from the environment, and
a failed verification/scoring step that was collapsed to zero.

Because the reward path is tensor-based end-to-end, those invalid rewards were then treated as ordinary samples by rollout aggregation, baseline/std computation, logging, and loss masking.

What Changed

Environment contract and reward producers

Added optional reward_valid_mask to EnvironmentReturn.
Updated verifier-backed environments to emit explicit validity alongside numeric rewards:
- math_environment.py
- code_jaccard_environment.py
- vlm_environment.py
Kept rewards numeric for compatibility and marked verifier failures as invalid instead of relying on sentinel values like None or NaN.
Fixed a bug in HFVerifyWorker where the default verifier implementation was computed but not actually used when the kwarg was omitted.

Reward collation and rollout plumbing

Normalized missing reward masks to all-valid in calculate_rewards() for backward compatibility.
Preserved reward validity ordering when grouping/sorting environment outputs by task.
Propagated sample-level reward validity through both sync and async multi-turn rollout paths.
Added rollout metrics for invalid reward count/rate.

GRPO and advantage computation

Added helper normalization/validation for batch loss multipliers and reward-valid masks in grpo.py.
Excluded invalid rewards from GRPO baseline/std calculations and reward statistics.
Zeroed the effective sample loss multiplier for invalid-reward samples.
Logged reward-validity metadata in train/validation JSONL outputs and surfaced invalid reward metrics in console/logger metrics.
Updated GRPO, GDPO, and Reinforce++ advantage estimators to respect sample validity.
Hardened Reinforce++ so invalid samples remain masked even when KL-in-reward is enabled.
Avoided edge-case normalization issues for tiny valid sample sets in GDPO and validation logging.

Why This Design

This keeps the existing reward tensors intact while introducing a single explicit validity signal that composes cleanly with the existing masking model:

valid rewards still behave exactly as before
invalid rewards no longer affect reward statistics or policy updates
unaffected environments can stay unchanged and inherit the all-valid default path

That makes the implementation targeted and low-complexity while solving the ambiguity described in #431.

Tests and Validation

Added and updated focused unit coverage for:

verifier failure -> invalid reward mask in math/code jaccard environments
reward-valid-mask ordering/defaulting in reward collation
reward-valid-mask propagation in sync and async rollouts
GRPO/GDPO/Reinforce++ estimator masking behavior
the HFVerifyWorker default verifier path

Checks run locally in the isolated worktree:

python3 -m py_compile on touched source and test files
git diff --check
focused direct verification script covering rollout propagation, estimator masking, and verifier success/failure cases

Full Ray-backed pytest was not runnable in this sandbox because local Ray startup timed out, and the GRPO test module also depends on optional Megatron imports not present in the environment used here.

copy-pr-bot · 2026-04-03T17:55:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Handle invalid environment rewards explicitly

42db946

github-actions bot added the community-request label Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invalid environment rewards explicitly across environments, rollouts, and GRPO#2207

Handle invalid environment rewards explicitly across environments, rollouts, and GRPO#2207
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-431-invalid-reward-mask

taivu1998 commented Apr 3, 2026

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented Apr 3, 2026

Summary

Root Cause

What Changed

Environment contract and reward producers

Reward collation and rollout plumbing

GRPO and advantage computation

Why This Design

Tests and Validation

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants