Skip to content

feat: Implement Dynamic Environment Orchestrator (DEO) for Elastic Scaling and Resilience#438

Open
RUFFY-369 wants to merge 15 commits intoNousResearch:mainfrom
RUFFY-369:feat/deo-elastic-orchestrator
Open

feat: Implement Dynamic Environment Orchestrator (DEO) for Elastic Scaling and Resilience#438
RUFFY-369 wants to merge 15 commits intoNousResearch:mainfrom
RUFFY-369:feat/deo-elastic-orchestrator

Conversation

@RUFFY-369
Copy link
Copy Markdown

PR Type

  • RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

This PR implements the Dynamic Environment Orchestrator (DEO), a resilient control plane for managing elastic scaling of environment workers. The current static allocation model in atroposlib often leading to VRAM fragmentation or rollout starvation during high-variance RL workloads.

Technical Implementations:

  • PID-Dampened Scaling: Implements a scaling controller with hysteresis to prevent "flapping" during transient throughput spikes.
  • Resilient Execution: Uses os.setpgrp for process group isolation and logic to adopt orphaned child processes on startup, preventing resource leakage on GPU nodes.
  • Graceful Draining: Adds SIGUSR1 signal handling across the environment stack, allowing workers to complete in-flight rollouts before exiting.
  • Hardware Telemetry: Integration with nvidia-smi to cordon thermally throttled or memory-saturated GPUs during scale-up.
  • API Expansion: Adds a /global-status endpoint to the Atropos server to expose real-time "Rollout Pressure" metrics.

Related Issues

It solves #437

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Code refactor (no functional changes)
  • Build/CI/CD related changes
  • Other (please describe):


✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • Docstrings added for all new public classes / functions
  • If .env vars required, did you add it to the .env.example in repo root?

RUFFY-369 and others added 15 commits April 3, 2026 01:58
…chestration

Implemented a production-grade orchestrator (DEO) with:
- Non-blocking scaling with SIGUSR1 drainage for zero-data-loss.
- Hardware isolation via CUDA_VISIBLE_DEVICES injection.
- Self-healing CrashLoopBackOff detection and heartbeat grace periods.
- Proactive GPU cordoning based on real-time health checks.
- Robust workspace sanitization and orphaned process adoption.

Verified on RTX 3090 with real-world model deployment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant