fix: resolve distributed stability issues, LoRA weight sync, and recovery crash by Wangxiaoxiaoa · Pull Request #438 · alibaba/ROLL

Wangxiaoxiaoa · 2026-05-08T04:32:12Z

Description

This PR addresses several critical stability issues encountered during large-scale RLVR pipeline execution on an H200 cluster, improving robustness for distributed setups and LoRA scenarios.

Linked Issue:

Fixes #435

Changes

LoRA Synchronization: Enhanced DeepSpeedWeightUpdater to natively handle PEFT/LoRA state dicts. Added _add_lora_to_infer_workers to ensure peft_config is synchronized across remote rollout workers, enabling reliable weight hot-swapping.
State Recovery: Resolved an AttributeError in DynamicSamplingScheduler by re-ordering the initialization sequence to ensure dataset_iter is ready before any data access during checkpoint resumption.
DeepSpeed Robustness: Modified deepspeed_utils.py to automatically filter out empty parameter groups, preventing initialization failures when base layers are frozen.
Scalable Resource Management:
- Introduced the ROLL_RAY_TEMP_DIR environment variable to allow redirection of massive Ray metadata/logs to high-capacity disk partitions.
- Optimized rlvr_vlm_pipeline.py for single-domain tasks by avoiding redundant full-dataset rewrites of large Multimodal Arrow files, mitigating pyarrow offset overflows.
Automation: Added automatic derivation of max_steps based on dataset size and rollout batch configurations.

Type of change

Bug fix (non-breaking change which fixes an issue)
Performance improvement (optimization)

CLAassistant · 2026-05-08T04:32:20Z

All committers have signed the CLA.

…ry crash, and DeepSpeed empty groups (alibaba#435)

fix: resolve distributed stability issues including LoRA sync, recove…

c2b3a0d

…ry crash, and DeepSpeed empty groups (alibaba#435)

Wangxiaoxiaoa force-pushed the fix/dist_stability_lora branch from d067c89 to c2b3a0d Compare May 8, 2026 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve distributed stability issues, LoRA weight sync, and recovery crash#438

fix: resolve distributed stability issues, LoRA weight sync, and recovery crash#438
Wangxiaoxiaoa wants to merge 1 commit intoalibaba:mainfrom
Wangxiaoxiaoa:fix/dist_stability_lora

Wangxiaoxiaoa commented May 8, 2026

Uh oh!

CLAassistant commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Wangxiaoxiaoa commented May 8, 2026

Description

Linked Issue:

Changes

Type of change

Uh oh!

CLAassistant commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented May 8, 2026 •

edited

Loading