Skip to content

feat(server_handling): Implement Native Stateful SGLang Infrastructure with Delta-Sync & Session Pinning#443

Open
RUFFY-369 wants to merge 10 commits intoNousResearch:mainfrom
RUFFY-369:feature/smg-native-stateful-routing
Open

feat(server_handling): Implement Native Stateful SGLang Infrastructure with Delta-Sync & Session Pinning#443
RUFFY-369 wants to merge 10 commits intoNousResearch:mainfrom
RUFFY-369:feature/smg-native-stateful-routing

Conversation

@RUFFY-369
Copy link
Copy Markdown

PR Type

  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

This PR introduces a production-grade Stateful SGLang Infrastructure to the Atropos repository, specifically designed to meet the high-performance reasoning requirements of the Hermes 4 era.

Historically, Atropos was deliberately stateless for universal compatibility. This PR evolves that architecture to support Stateful Reasoning for SGLang backends, enabling massive performance gains in multi-turn reasoning chains.

Key Technical Enhancements:

  1. Delta-Sync Protocol: Implemented StatefulSGLangServer which transmits only the delta_input_ids to the worker. This achieves O(1) bandwidth scaling and a verified >80% reduction in inbound network serialization.
  2. Deterministic Session Pinning: Integrated Consistent Hashing into the ServerManager (get_consistent_worker_index). This guarantees that multi-turn reasoning sessions are pinned to the same GPU worker, enabling near 100% KV-cache residency using SGLang's RadixAttention.
  3. Auto-Rebuild Resilience: Implemented an automated state-recovery mechanism that transparently re-primes the server-side state in the event of a cache eviction or worker restart, ensuring zero trajectory loss.
  4. Lightweight Monitoring: Migrated health checks to a lightweight GET /health protocol to eliminate inference-heavy pings and ensure cluster stability under high load.

Performance Impact:

  • TTFT Optimization: Theoretically eliminates 10s–20s of "cold" prefill latency for 70B+ models by ensuring cache hits across turns.
  • Network Efficiency: Scaled multi-turn traffic from O(N²) to O(N) by eliminating redundant history transmission.

Related Issues

Solves #442

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Code refactor (sanitization and modernization)
  • This change requires a documentation update (Stateful server usage)

✅ Developer & Reviewer Checklist

  • Code follows project style (Sanitized of non-technical verbosity)
  • I have performed a self-review of my own code
  • I have commented my code, focusing on technical implementation details
  • New and existing unit tests pass locally with my changes (test_server_pinning.py)
  • Docstrings added for all new public classes / functions
  • Verified E2E on 2x RTX 3090 hardware cluster

RUFFY-369 and others added 10 commits April 8, 2026 16:14
…frastructure

- Implemented StatefulSGLangServer with Delta-Sync protocol and Auto-Rebuild resilience.
- Integrated deterministic session-to-worker pinning via consistent hashing in ServerManager.
- Hardened pinning logic with 3-retry health check resiliency to handle high load jitter.
- Optimized status monitoring to use lightweight /health protocol.
- Significant reduction (>80%) in network payload and speedup in TTFT (Time To First Token) via cache hits.
- Verified E2E on 2x RTX 3090 hardware.
- Condense verbose comments and docstrings for technical clarity.
- Professionalize terminal reporting and utility logs.
- Simplify routing and pinning logic documentation.
- Verified zero regressions in logic via regression test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant