Skip to content

fix: resolve distributed stability issues, LoRA weight sync, and recovery crash#438

Open
Wangxiaoxiaoa wants to merge 1 commit intoalibaba:mainfrom
Wangxiaoxiaoa:fix/dist_stability_lora
Open

fix: resolve distributed stability issues, LoRA weight sync, and recovery crash#438
Wangxiaoxiaoa wants to merge 1 commit intoalibaba:mainfrom
Wangxiaoxiaoa:fix/dist_stability_lora

Conversation

@Wangxiaoxiaoa
Copy link
Copy Markdown

Description

This PR addresses several critical stability issues encountered during large-scale RLVR pipeline execution on an H200 cluster, improving robustness for distributed setups and LoRA scenarios.

Linked Issue:

Fixes #435

Changes

  • LoRA Synchronization: Enhanced DeepSpeedWeightUpdater to natively handle PEFT/LoRA state dicts. Added _add_lora_to_infer_workers to ensure peft_config is synchronized across remote rollout workers, enabling reliable weight hot-swapping.
  • State Recovery: Resolved an AttributeError in DynamicSamplingScheduler by re-ordering the initialization sequence to ensure dataset_iter is ready before any data access during checkpoint resumption.
  • DeepSpeed Robustness: Modified deepspeed_utils.py to automatically filter out empty parameter groups, preventing initialization failures when base layers are frozen.
  • Scalable Resource Management:
    • Introduced the ROLL_RAY_TEMP_DIR environment variable to allow redirection of massive Ray metadata/logs to high-capacity disk partitions.
    • Optimized rlvr_vlm_pipeline.py for single-domain tasks by avoiding redundant full-dataset rewrites of large Multimodal Arrow files, mitigating pyarrow offset overflows.
  • Automation: Added automatic derivation of max_steps based on dataset size and rollout batch configurations.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Performance improvement (optimization)

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 8, 2026

CLA assistant check
All committers have signed the CLA.

@Wangxiaoxiaoa Wangxiaoxiaoa force-pushed the fix/dist_stability_lora branch from d067c89 to c2b3a0d Compare May 8, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Training/Inference crash with Qwen2/3-VL due to missing mm_token_type_ids in Collator

2 participants