feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup#10726
feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup#10726
Conversation
…ner startup Instead of computing PyTorch/TensorFlow environment variables in the manager registry (as proposed in PR #4244), derive them from existing BACKENDAI_* cluster variables at container startup via a sourced shell script. This keeps the manager clean, avoids fragile image-name matching, and lets any image benefit from the setup automatically. Resolves #4243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds container-startup logic in the runner to derive PyTorch/TensorFlow distributed training environment variables from existing BACKENDAI_* cluster session variables, avoiding manager-side registry changes and making the behavior image-agnostic and user-overridable.
Changes:
- Add
setup_dist_environ.shto deriveWORLD_SIZE,RANK,LOCAL_RANK,MASTER_ADDR,MASTER_PORT, andTF_CONFIGfor multi-container cluster sessions. - Source the new script from the runner
entrypoint.shin both root and non-root execution paths. - Document the derived distributed-training variables and override behavior in the networking concepts doc.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/ai/backend/runner/setup_dist_environ.sh |
New sourced script that derives framework env vars from BACKENDAI_* cluster variables. |
src/ai/backend/runner/entrypoint.sh |
Sources the new distributed-env setup script during container initialization. |
docs/concepts/networking.rst |
Documents the new derived env vars and precedence/activation rules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export RANK="$BACKENDAI_CLUSTER_LOCAL_RANK" | ||
| fi | ||
| if [ -z "$LOCAL_RANK" ]; then | ||
| export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK" |
There was a problem hiding this comment.
LOCAL_RANK is being derived from BACKENDAI_CLUSTER_LOCAL_RANK, but this Backend.AI variable is documented as a global rank (and is assigned sequentially across all kernels). In PyTorch/TorchElastic conventions, LOCAL_RANK is typically node/container-local and is commonly used for device selection; setting it to the global rank can exceed the number of visible devices in a container (e.g., rank 3 with a single GPU) and break training. Consider not setting LOCAL_RANK by default (leave it to launchers like torchrun), or default it to 0 when running one process per container.
| export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK" | |
| # BACKENDAI_CLUSTER_LOCAL_RANK is a global rank across all kernels. | |
| # LOCAL_RANK should be node/container-local; default to 0 for one process | |
| # per container and let launchers like torchrun override it if needed. | |
| export LOCAL_RANK="0" |
| IFS=',' | ||
| for host in $BACKENDAI_CLUSTER_HOSTS; do | ||
| if [ -n "$TF_WORKER_LIST" ]; then | ||
| TF_WORKER_LIST="${TF_WORKER_LIST}," | ||
| fi | ||
| TF_WORKER_LIST="${TF_WORKER_LIST}\"${host}:${DIST_MASTER_PORT}\"" | ||
| done | ||
| unset IFS |
There was a problem hiding this comment.
This script is sourced from the entrypoint, but it mutates IFS and then does unset IFS. If the caller had a non-default IFS, unsetting will not restore it, potentially impacting later shell logic in the entrypoint. Save the original IFS before changing it and restore it afterwards.
| - Zero-based global index of the current container within the cluster. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``. | ||
| - ``0`` | ||
| * - ``LOCAL_RANK`` | ||
| - Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``. |
There was a problem hiding this comment.
The new docs describe BACKENDAI_CLUSTER_LOCAL_RANK as a global index (line 59), but this table says LOCAL_RANK is a local index while still being derived from BACKENDAI_CLUSTER_LOCAL_RANK. Please reconcile the terminology: either describe LOCAL_RANK as global in this context, or change the derivation/variable names so LOCAL_RANK matches framework expectations.
| - Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``. | |
| - Zero-based global index of the current container in this Backend.AI cluster (alias of ``RANK`` for compatibility). Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``. |
| fi | ||
|
|
||
| # Set up distributed training environment variables from BACKENDAI_* cluster vars. | ||
| . /opt/kernel/setup_dist_environ.sh |
There was a problem hiding this comment.
The entrypoint unconditionally sources /opt/kernel/setup_dist_environ.sh. To avoid noisy startup errors (or failures if set -e is introduced later), consider guarding this with a file-existence check and emitting a clear warning only when expected.
| . /opt/kernel/setup_dist_environ.sh | |
| if [ -f "/opt/kernel/setup_dist_environ.sh" ]; then | |
| . /opt/kernel/setup_dist_environ.sh | |
| else | |
| echo "WARNING: /opt/kernel/setup_dist_environ.sh not found; distributed training environment variables were not initialized." | |
| fi |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hers WORLD_SIZE, RANK, and LOCAL_RANK are intentionally not pre-set at the container level — launchers like torchrun set these per-process based on GPUs per node, and pre-setting them would conflict in multi-GPU setups. Also assign unique ports per TF worker (base port + rank) to avoid conflicts when multiple workers share a host. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Coverage for both frameworks
Both frameworks are fully covered with the current setup. |
rapsealk
left a comment
There was a problem hiding this comment.
Missing mount breaks all session startups
entrypoint.sh unconditionally sources the new script:
. /opt/kernel/setup_dist_environ.shBut setup_dist_environ.sh is not mounted into the container. The mount provisioner (src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py) explicitly lists every file that gets bind-mounted to /opt/kernel/, and this file is missing from _prepare_default_mounts().
In POSIX sh, sourcing a non-existent file with . is a fatal error — the shell exits immediately. This means every session startup crashes, not just single-kernel ones.
Fix
- Add the mount in
_prepare_default_mounts():
self._parse_mount("runner/setup_dist_environ.sh", "/opt/kernel/setup_dist_environ.sh"),- Guard the source in
entrypoint.sh(both root and non-root paths) for robustness against older agent versions:
if [ -f /opt/kernel/setup_dist_environ.sh ]; then
. /opt/kernel/setup_dist_environ.sh
fiMinor: stale description & changelog
The PR description table and changes/10726.feature.md still list WORLD_SIZE, RANK, LOCAL_RANK as derived variables, but commit 14320d7 intentionally removed those. These should be updated to reflect the current behavior (only MASTER_ADDR, MASTER_PORT, and TF_CONFIG).
The new distributed training script was sourced unconditionally in entrypoint.sh but never mounted into the container, causing all session startups to fail. Add the missing bind-mount in the krunner mount provisioner and guard both source calls with a file-existence check for robustness against older agent versions. Also update the changelog to reflect that WORLD_SIZE, RANK, and LOCAL_RANK are no longer set (removed in 14320d7). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| return [ | ||
| self._parse_mount("runner/extract_dotfiles.py", "/opt/kernel/extract_dotfiles.py"), | ||
| self._parse_mount("runner/entrypoint.sh", "/opt/kernel/entrypoint.sh"), | ||
| self._parse_mount("runner/setup_dist_environ.sh", "/opt/kernel/setup_dist_environ.sh"), |
There was a problem hiding this comment.
Please update src/ai/backend/agent/agent.py mount_krunner() method
- Add setup_dist_environ.sh mount to agent.py mount_krunner() (fregataa) - Add setup_dist_environ.sh to Kubernetes copy_runner_files() - Add warning message when setup_dist_environ.sh is not found - Eliminate leaked internal variables by using underscore-prefixed names and unsetting them after use - Save/restore IFS instead of unsetting it to preserve caller state - Use inline expressions instead of intermediate variables for MASTER_ADDR and MASTER_PORT to avoid polluting the environment - Add fallback for BACKENDAI_CLUSTER_LOCAL_RANK in TF_CONFIG JSON Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changes since last reviewAddressed all feedback from @fregataa, Copilot, and my own production-readiness review: Fixes applied
Mount provisioners updated (all three)The script is now registered in all container provisioning paths:
All lint, type checks, and pre-commit hooks pass. |
Summary
Resolves #4243 (BA-1214) — alternative approach to PR #4244 (which was closed with changes requested).
Instead of computing PyTorch/TensorFlow distributed training environment variables in the manager registry, this PR derives them from the existing
BACKENDAI_*cluster variables at container startup via a sourced shell script in the runner entrypoint.29500instead of hardcoded12345; overridable viaBACKENDAI_DIST_MASTER_PORTBACKENDAI_CLUSTER_SIZE <= 1)Variables derived
MASTER_ADDRBACKENDAI_CLUSTER_HOSTSMASTER_PORTBACKENDAI_DIST_MASTER_PORT(default:29500)TF_CONFIGBACKENDAI_CLUSTER_HOSTS+BACKENDAI_CLUSTER_LOCAL_RANKFiles changed
src/ai/backend/runner/setup_dist_environ.sh— new script sourced during container initsrc/ai/backend/runner/entrypoint.sh— sources the new script (both root and non-root paths) with file-existence guardsrc/ai/backend/agent/agent.py— mount the script inmount_krunner()(legacy Docker path)src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py— mount the script in_prepare_default_mounts()(stage-based Docker path)src/ai/backend/agent/kubernetes/kernel.py— include the script incopy_runner_files()docs/concepts/networking.rst— documents the new variablesTest plan
MASTER_ADDR,MASTER_PORTare correctly setTF_CONFIGis correctly setMASTER_PORTmanually in session environ → the manual value is preserved (not overridden)BACKENDAI_DIST_MASTER_PORT=12345→MASTER_PORTuses12345instead of29500🤖 Generated with Claude Code
📚 Documentation preview 📚: https://sorna--10726.org.readthedocs.build/en/10726/
📚 Documentation preview 📚: https://sorna-ko--10726.org.readthedocs.build/ko/10726/