Skip to content

Tk/infra#2238

Draft
terrykong wants to merge 6 commits intomainfrom
tk/infra
Draft

Tk/infra#2238
terrykong wants to merge 6 commits intomainfrom
tk/infra

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Local K8s GPU dev environment using nvkind (NVIDIA's kind wrapper):
- nvkind cluster setup scripts (install-nvkind.sh, create-cluster.sh)
- Custom config template with extraMounts for dev code mounting
- Helmfile with kind/prod environments (device plugin vs GPU operator)
- KAI scheduler for gang scheduling, KubeRay for RayCluster management
- Example manifests: gang-scheduled pods, RayClusters, SFT RayJobs
- SETUP.md with prerequisites, quick start, and architecture docs

Tested: SFT RayJob (train/loss 4.06 < 5.9), KAI all-or-nothing
gang scheduling, two simultaneous 1-GPU SFT jobs.
Add optional remote_gym_url to NemoGymConfig. When set, the NemoGym
Ray actor connects to an external Gym HTTP service instead of spawning
local subprocesses. Colocated mode (default) is unchanged.

- nemo_gym.py: split __init__ into remote/colocated paths
- run_grpo_nemo_gym.py: support env.remote_gym_url and env.disagg_job_id
- Gym submodule: standalone_server.py entry point with K8s endpoint
  registry integration, use_absolute_ip for cross-pod communication
- gym_standalone_config.yaml: example config for standalone server

Tested: disaggregated GRPO completed 3 training steps with RL on one
RayCluster (2 GPU) and Gym on a separate RayCluster (CPU only).
…overy

Each (RL, Gym) job pair shares a ConfigMap for dynamic address exchange.
Both sides register their IP:port and poll for the peer's address.
The ConfigMap has an ownerReference to the RL RayCluster for automatic
garbage collection on teardown.

- k8s_endpoint_registry.py: create/set/get/get_nowait methods with race
  condition handling (409 retry) and proper error propagation
- endpoint-registry-rbac.yaml: ServiceAccount + Role + RoleBinding
- disagg_rl_raycluster.yaml: RL cluster with serviceAccountName
- disagg_gym_raycluster.yaml: Gym cluster with serviceAccountName

Tested: ConfigMap CRUD verified in-cluster, bidirectional URL exchange
between RL and Gym clusters confirmed working.
When RL and Gym run on separate RayClusters, either cluster failing or
being deleted triggers teardown of both clusters to release resources.

- peer-watcher.py: pure Python sidecar (no deps beyond stdlib), deployed
  as a ConfigMap volume mount on each head pod
- Monitors peer RayCluster status via K8s API (polls every 10s)
- Tears down after MAX_PEER_FAILURES (default 3) consecutive failures
- Also monitors ConfigMap "error" key for application-level error signaling
- Handles transient K8s API errors as failures (not false-healthy)
- Added signal_error() to K8sEndpointRegistry
- Updated disagg manifests with peer-watcher sidecar containers
- Updated RBAC with "delete" verb for rayclusters

Tested: deleting either cluster triggers teardown of both within ~10s.
…share configs

- Kyverno policy: RayCluster/RayJob must have kai.scheduler/queue label.
  Validates at CRD level (not pod) since KubeRay operator creates pods.
  Optional Policy 2 for user→queue access control via ConfigMap.
- kube-prometheus-stack: Prometheus + Grafana for fairshare monitoring.
  Pre-built Grafana dashboard showing GPU allocation vs fair share,
  preemption events, and scheduling latency per queue.
- ServiceMonitors for KAI scheduler, binder, and queue-controller.
- Example queue configs:
  - kai-queue.yaml: 2-GPU kind cluster (2 teams, equal quotas)
  - kai-queue-prod.yaml: 256-GPU prod (3 departments, 6 teams)
  - preemptMinRuntime: 4h (protect long training runs from priority preemption)
  - reclaimMinRuntime: 15m (fast fairness reclaim of over-quota resources)
- SETUP.md: fairshare docs, preempt vs reclaim explanation, Grafana access.

Tested: Kyverno rejects RayCluster without queue label, accepts with.
Team A 2-GPU job reclaimed when Team B submitted to its guaranteed quota.
- Upgrade KAI scheduler v0.13.4 → v0.14.0 (adds Ray topology-aware
  scheduling, segment-size annotation support for PyTorchJob)
- Update chart URL from NVIDIA/KAI-Scheduler to kai-scheduler/KAI-Scheduler
- Fix Grafana dashboard metric names (add kai_ prefix to match actual
  Prometheus metric names). Verified: Grafana queries return live data.
- New: extensions/k8s_cli/ — standalone Python CLI (pip installable):
  - nrl-k8s fairshare — show queue config (quota, limit, weight, priority)
  - nrl-k8s occupancy — show GPU allocation per node and per queue
  - nrl-k8s submit — submit gang-scheduled RayJob with optional
    --segment-size for topology-aware scheduling
  - 6 unit tests (mocked K8s API), all passing
- Add TODO for NVL72 topology testing with links to relevant PRs/issues

Tested: KAI v0.14.0 gang scheduling works, CLI commands verified against
live cluster, Grafana dashboard loads and queries return data.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant