Skip to content

feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup#10726

Open
rapsealk wants to merge 8 commits intomainfrom
feature/distributed-training-env-vars-runner
Open

feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup#10726
rapsealk wants to merge 8 commits intomainfrom
feature/distributed-training-env-vars-runner

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 1, 2026

Summary

Resolves #4243 (BA-1214) — alternative approach to PR #4244 (which was closed with changes requested).

Instead of computing PyTorch/TensorFlow distributed training environment variables in the manager registry, this PR derives them from the existing BACKENDAI_* cluster variables at container startup via a sourced shell script in the runner entrypoint.

  • No manager changes needed — addresses the reviewer feedback that the registry was getting too large
  • No fragile image-name matching — works for any cluster session regardless of image name
  • User-overridable — if a variable is already set (e.g., via session environ or image bootstrap), the user's value takes precedence
  • Configurable port — uses PyTorch's default 29500 instead of hardcoded 12345; overridable via BACKENDAI_DIST_MASTER_PORT
  • Only activates for cluster sessions — single-container sessions are unaffected (BACKENDAI_CLUSTER_SIZE <= 1)

Variables derived

Framework var Derived from
MASTER_ADDR First entry in BACKENDAI_CLUSTER_HOSTS
MASTER_PORT BACKENDAI_DIST_MASTER_PORT (default: 29500)
TF_CONFIG Constructed from BACKENDAI_CLUSTER_HOSTS + BACKENDAI_CLUSTER_LOCAL_RANK

Note: WORLD_SIZE, RANK, and LOCAL_RANK are intentionally not pre-set. Launchers like torchrun set these per-process based on the number of GPUs per node. Pre-setting them at the container level would conflict with multi-GPU-per-node setups.

Files changed

  • src/ai/backend/runner/setup_dist_environ.sh — new script sourced during container init
  • src/ai/backend/runner/entrypoint.sh — sources the new script (both root and non-root paths) with file-existence guard
  • src/ai/backend/agent/agent.py — mount the script in mount_krunner() (legacy Docker path)
  • src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py — mount the script in _prepare_default_mounts() (stage-based Docker path)
  • src/ai/backend/agent/kubernetes/kernel.py — include the script in copy_runner_files()
  • docs/concepts/networking.rst — documents the new variables

Test plan

  • Launch a single-container session → no distributed training vars are set
  • Launch a multi-container cluster session with a PyTorch image → MASTER_ADDR, MASTER_PORT are correctly set
  • Launch a multi-container cluster session with a TensorFlow image → TF_CONFIG is correctly set
  • Set MASTER_PORT manually in session environ → the manual value is preserved (not overridden)
  • Set BACKENDAI_DIST_MASTER_PORT=12345MASTER_PORT uses 12345 instead of 29500

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://sorna--10726.org.readthedocs.build/en/10726/


📚 Documentation preview 📚: https://sorna-ko--10726.org.readthedocs.build/ko/10726/

…ner startup

Instead of computing PyTorch/TensorFlow environment variables in the manager
registry (as proposed in PR #4244), derive them from existing BACKENDAI_*
cluster variables at container startup via a sourced shell script. This keeps
the manager clean, avoids fragile image-name matching, and lets any image
benefit from the setup automatically.

Resolves #4243

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 11:12
@github-actions github-actions bot added size:M 30~100 LoC area:docs Documentations labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds container-startup logic in the runner to derive PyTorch/TensorFlow distributed training environment variables from existing BACKENDAI_* cluster session variables, avoiding manager-side registry changes and making the behavior image-agnostic and user-overridable.

Changes:

  • Add setup_dist_environ.sh to derive WORLD_SIZE, RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, and TF_CONFIG for multi-container cluster sessions.
  • Source the new script from the runner entrypoint.sh in both root and non-root execution paths.
  • Document the derived distributed-training variables and override behavior in the networking concepts doc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/ai/backend/runner/setup_dist_environ.sh New sourced script that derives framework env vars from BACKENDAI_* cluster variables.
src/ai/backend/runner/entrypoint.sh Sources the new distributed-env setup script during container initialization.
docs/concepts/networking.rst Documents the new derived env vars and precedence/activation rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

export RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"
fi
if [ -z "$LOCAL_RANK" ]; then
export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOCAL_RANK is being derived from BACKENDAI_CLUSTER_LOCAL_RANK, but this Backend.AI variable is documented as a global rank (and is assigned sequentially across all kernels). In PyTorch/TorchElastic conventions, LOCAL_RANK is typically node/container-local and is commonly used for device selection; setting it to the global rank can exceed the number of visible devices in a container (e.g., rank 3 with a single GPU) and break training. Consider not setting LOCAL_RANK by default (leave it to launchers like torchrun), or default it to 0 when running one process per container.

Suggested change
export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"
# BACKENDAI_CLUSTER_LOCAL_RANK is a global rank across all kernels.
# LOCAL_RANK should be node/container-local; default to 0 for one process
# per container and let launchers like torchrun override it if needed.
export LOCAL_RANK="0"

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +52
IFS=','
for host in $BACKENDAI_CLUSTER_HOSTS; do
if [ -n "$TF_WORKER_LIST" ]; then
TF_WORKER_LIST="${TF_WORKER_LIST},"
fi
TF_WORKER_LIST="${TF_WORKER_LIST}\"${host}:${DIST_MASTER_PORT}\""
done
unset IFS
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is sourced from the entrypoint, but it mutates IFS and then does unset IFS. If the caller had a non-default IFS, unsetting will not restore it, potentially impacting later shell logic in the entrypoint. Save the original IFS before changing it and restore it afterwards.

Copilot uses AI. Check for mistakes.
Comment thread docs/concepts/networking.rst Outdated
- Zero-based global index of the current container within the cluster. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.
- ``0``
* - ``LOCAL_RANK``
- Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new docs describe BACKENDAI_CLUSTER_LOCAL_RANK as a global index (line 59), but this table says LOCAL_RANK is a local index while still being derived from BACKENDAI_CLUSTER_LOCAL_RANK. Please reconcile the terminology: either describe LOCAL_RANK as global in this context, or change the derivation/variable names so LOCAL_RANK matches framework expectations.

Suggested change
- Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.
- Zero-based global index of the current container in this Backend.AI cluster (alias of ``RANK`` for compatibility). Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.

Copilot uses AI. Check for mistakes.
Comment thread src/ai/backend/runner/entrypoint.sh Outdated
fi

# Set up distributed training environment variables from BACKENDAI_* cluster vars.
. /opt/kernel/setup_dist_environ.sh
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entrypoint unconditionally sources /opt/kernel/setup_dist_environ.sh. To avoid noisy startup errors (or failures if set -e is introduced later), consider guarding this with a file-existence check and emitting a clear warning only when expected.

Suggested change
. /opt/kernel/setup_dist_environ.sh
if [ -f "/opt/kernel/setup_dist_environ.sh" ]; then
. /opt/kernel/setup_dist_environ.sh
else
echo "WARNING: /opt/kernel/setup_dist_environ.sh not found; distributed training environment variables were not initialized."
fi

Copilot uses AI. Check for mistakes.
@rapsealk rapsealk changed the title feat: derive distributed training env vars from BACKENDAI_* at container startup feat: Derive distributed training env vars from BACKENDAI_* at container startup Apr 1, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size:L 100~500 LoC and removed size:M 30~100 LoC labels Apr 1, 2026
@rapsealk rapsealk requested a review from a team April 1, 2026 11:22
@rapsealk rapsealk added this to the 26.4 milestone Apr 1, 2026
@rapsealk rapsealk changed the title feat: Derive distributed training env vars from BACKENDAI_* at container startup feat(BA-1214): Derive distributed training env vars from BACKENDAI_* at container startup Apr 1, 2026
@rapsealk rapsealk changed the title feat(BA-1214): Derive distributed training env vars from BACKENDAI_* at container startup feat: Derive PyTorch/TensorFlow distributed training env vars at container startup Apr 1, 2026
…hers

WORLD_SIZE, RANK, and LOCAL_RANK are intentionally not pre-set at the
container level — launchers like torchrun set these per-process based on
GPUs per node, and pre-setting them would conflict in multi-GPU setups.

Also assign unique ports per TF worker (base port + rank) to avoid
conflicts when multiple workers share a host.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size:M 30~100 LoC and removed size:L 100~500 LoC labels Apr 1, 2026
@rapsealk
Copy link
Copy Markdown
Member Author

rapsealk commented Apr 1, 2026

Coverage for both frameworks

  • PyTorch: MASTER_ADDR + MASTER_PORT are the two variables that torchrun needs from the environment but cannot auto-discover. WORLD_SIZE, RANK, and LOCAL_RANK are set per-process by torchrun based on GPUs per node.
  • TensorFlow: TF_CONFIG is the standard mechanism for MultiWorkerMirroredStrategy — it's self-contained (includes full cluster topology and task index). TF does not use MASTER_ADDR/MASTER_PORT.

Both frameworks are fully covered with the current setup.

@fregataa fregataa changed the title feat: Derive PyTorch/TensorFlow distributed training env vars at container startup feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup Apr 2, 2026
Copy link
Copy Markdown
Member

@fregataa fregataa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot create a single kernel session, could you check once more? @rapsealk

Copy link
Copy Markdown
Member Author

@rapsealk rapsealk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing mount breaks all session startups

entrypoint.sh unconditionally sources the new script:

. /opt/kernel/setup_dist_environ.sh

But setup_dist_environ.sh is not mounted into the container. The mount provisioner (src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py) explicitly lists every file that gets bind-mounted to /opt/kernel/, and this file is missing from _prepare_default_mounts().

In POSIX sh, sourcing a non-existent file with . is a fatal error — the shell exits immediately. This means every session startup crashes, not just single-kernel ones.

Fix

  1. Add the mount in _prepare_default_mounts():
self._parse_mount("runner/setup_dist_environ.sh", "/opt/kernel/setup_dist_environ.sh"),
  1. Guard the source in entrypoint.sh (both root and non-root paths) for robustness against older agent versions:
if [ -f /opt/kernel/setup_dist_environ.sh ]; then
  . /opt/kernel/setup_dist_environ.sh
fi

Minor: stale description & changelog

The PR description table and changes/10726.feature.md still list WORLD_SIZE, RANK, LOCAL_RANK as derived variables, but commit 14320d7 intentionally removed those. These should be updated to reflect the current behavior (only MASTER_ADDR, MASTER_PORT, and TF_CONFIG).

The new distributed training script was sourced unconditionally in
entrypoint.sh but never mounted into the container, causing all
session startups to fail. Add the missing bind-mount in the krunner
mount provisioner and guard both source calls with a file-existence
check for robustness against older agent versions.

Also update the changelog to reflect that WORLD_SIZE, RANK, and
LOCAL_RANK are no longer set (removed in 14320d7).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component and removed size:M 30~100 LoC labels Apr 2, 2026
return [
self._parse_mount("runner/extract_dotfiles.py", "/opt/kernel/extract_dotfiles.py"),
self._parse_mount("runner/entrypoint.sh", "/opt/kernel/entrypoint.sh"),
self._parse_mount("runner/setup_dist_environ.sh", "/opt/kernel/setup_dist_environ.sh"),
Copy link
Copy Markdown
Member

@fregataa fregataa Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update src/ai/backend/agent/agent.py mount_krunner() method

- Add setup_dist_environ.sh mount to agent.py mount_krunner() (fregataa)
- Add setup_dist_environ.sh to Kubernetes copy_runner_files()
- Add warning message when setup_dist_environ.sh is not found
- Eliminate leaked internal variables by using underscore-prefixed names
  and unsetting them after use
- Save/restore IFS instead of unsetting it to preserve caller state
- Use inline expressions instead of intermediate variables for
  MASTER_ADDR and MASTER_PORT to avoid polluting the environment
- Add fallback for BACKENDAI_CLUSTER_LOCAL_RANK in TF_CONFIG JSON

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rapsealk
Copy link
Copy Markdown
Member Author

rapsealk commented Apr 2, 2026

Changes since last review

Addressed all feedback from @fregataa, Copilot, and my own production-readiness review:

Fixes applied

Source Issue Fix
@fregataa Missing mount in agent.py mount_krunner() Added setup_dist_environ.sh bind-mount (agent.py:672-675)
Review Kubernetes copy_runner_files missing the script Added to target_files list (kubernetes/kernel.py:534)
Review Internal variables (BACKENDAI_DIST_MASTER_ADDR, TF_WORKER_*, etc.) leaked into container environment Eliminated intermediate vars for MASTER_ADDR/MASTER_PORT; underscore-prefixed + unset for TF_CONFIG temps
Review BACKENDAI_CLUSTER_LOCAL_RANK fallback in TF_CONFIG ${BACKENDAI_CLUSTER_LOCAL_RANK:-0} prevents malformed JSON
Copilot IFS not restored after mutation Save/restore with _OLD_IFS instead of unset IFS
Copilot No warning when script file not found Added else branch with warning message in both entrypoint paths
Review PR description listed stale vars (WORLD_SIZE, RANK, LOCAL_RANK) Updated description and changelog to match actual behavior

Mount provisioners updated (all three)

The script is now registered in all container provisioning paths:

  • agent.py mount_krunner() — legacy Docker
  • krunner.py _prepare_default_mounts() — stage-based Docker
  • kubernetes/kernel.py copy_runner_files() — Kubernetes

All lint, type checks, and pre-commit hooks pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:agent Related to Agent component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add default environment variables for PyTorch/TensorFlow distributed training

3 participants