feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup by rapsealk · Pull Request #10726 · lablup/backend.ai

rapsealk · 2026-04-01T11:12:50Z

Summary

Resolves #4243 (BA-1214) — alternative approach to PR #4244 (which was closed with changes requested).

Instead of computing PyTorch/TensorFlow distributed training environment variables in the manager registry, this PR derives them from the existing BACKENDAI_* cluster variables at container startup via a sourced shell script in the runner entrypoint.

No manager changes needed — addresses the reviewer feedback that the registry was getting too large
No fragile image-name matching — works for any cluster session regardless of image name
User-overridable — if a variable is already set (e.g., via session environ or image bootstrap), the user's value takes precedence
Configurable port — uses PyTorch's default 29500 instead of hardcoded 12345; overridable via BACKENDAI_DIST_MASTER_PORT
Only activates for cluster sessions — single-container sessions are unaffected (BACKENDAI_CLUSTER_SIZE <= 1)

Variables derived

Framework var	Derived from
`MASTER_ADDR`	First entry in `BACKENDAI_CLUSTER_HOSTS`
`MASTER_PORT`	`BACKENDAI_DIST_MASTER_PORT` (default: `29500`)
`TF_CONFIG`	Constructed from `BACKENDAI_CLUSTER_HOSTS` + `BACKENDAI_CLUSTER_LOCAL_RANK`

Note: WORLD_SIZE, RANK, and LOCAL_RANK are intentionally not pre-set. Launchers like torchrun set these per-process based on the number of GPUs per node. Pre-setting them at the container level would conflict with multi-GPU-per-node setups.

Files changed

src/ai/backend/runner/setup_dist_environ.sh — new script sourced during container init
src/ai/backend/runner/entrypoint.sh — sources the new script (both root and non-root paths) with file-existence guard
src/ai/backend/agent/agent.py — mount the script in mount_krunner() (legacy Docker path)
src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py — mount the script in _prepare_default_mounts() (stage-based Docker path)
src/ai/backend/agent/kubernetes/kernel.py — include the script in copy_runner_files()
docs/concepts/networking.rst — documents the new variables

Test plan

Launch a single-container session → no distributed training vars are set
Launch a multi-container cluster session with a PyTorch image → MASTER_ADDR, MASTER_PORT are correctly set
Launch a multi-container cluster session with a TensorFlow image → TF_CONFIG is correctly set
Set MASTER_PORT manually in session environ → the manual value is preserved (not overridden)
Set BACKENDAI_DIST_MASTER_PORT=12345 → MASTER_PORT uses 12345 instead of 29500

🤖 Generated with Claude Code

📚 Documentation preview 📚: https://sorna--10726.org.readthedocs.build/en/10726/

📚 Documentation preview 📚: https://sorna-ko--10726.org.readthedocs.build/ko/10726/

…ner startup Instead of computing PyTorch/TensorFlow environment variables in the manager registry (as proposed in PR #4244), derive them from existing BACKENDAI_* cluster variables at container startup via a sourced shell script. This keeps the manager clean, avoids fragile image-name matching, and lets any image benefit from the setup automatically. Resolves #4243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds container-startup logic in the runner to derive PyTorch/TensorFlow distributed training environment variables from existing BACKENDAI_* cluster session variables, avoiding manager-side registry changes and making the behavior image-agnostic and user-overridable.

Changes:

Add setup_dist_environ.sh to derive WORLD_SIZE, RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, and TF_CONFIG for multi-container cluster sessions.
Source the new script from the runner entrypoint.sh in both root and non-root execution paths.
Document the derived distributed-training variables and override behavior in the networking concepts doc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`src/ai/backend/runner/setup_dist_environ.sh`	New sourced script that derives framework env vars from `BACKENDAI_*` cluster variables.
`src/ai/backend/runner/entrypoint.sh`	Sources the new distributed-env setup script during container initialization.
`docs/concepts/networking.rst`	Documents the new derived env vars and precedence/activation rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T11:16:46Z

+  export RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"
+fi
+if [ -z "$LOCAL_RANK" ]; then
+  export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"


LOCAL_RANK is being derived from BACKENDAI_CLUSTER_LOCAL_RANK, but this Backend.AI variable is documented as a global rank (and is assigned sequentially across all kernels). In PyTorch/TorchElastic conventions, LOCAL_RANK is typically node/container-local and is commonly used for device selection; setting it to the global rank can exceed the number of visible devices in a container (e.g., rank 3 with a single GPU) and break training. Consider not setting LOCAL_RANK by default (leave it to launchers like torchrun), or default it to 0 when running one process per container.

Suggested change

export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"

# BACKENDAI_CLUSTER_LOCAL_RANK is a global rank across all kernels.

# LOCAL_RANK should be node/container-local; default to 0 for one process

# per container and let launchers like torchrun override it if needed.

export LOCAL_RANK="0"

Copilot · 2026-04-01T11:16:46Z

+  IFS=','
+  for host in $BACKENDAI_CLUSTER_HOSTS; do
+    if [ -n "$TF_WORKER_LIST" ]; then
+      TF_WORKER_LIST="${TF_WORKER_LIST},"
+    fi
+    TF_WORKER_LIST="${TF_WORKER_LIST}\"${host}:${DIST_MASTER_PORT}\""
+  done
+  unset IFS


This script is sourced from the entrypoint, but it mutates IFS and then does unset IFS. If the caller had a non-default IFS, unsetting will not restore it, potentially impacting later shell logic in the entrypoint. Save the original IFS before changing it and restore it afterwards.

Copilot · 2026-04-01T11:16:46Z

+     - Zero-based global index of the current container within the cluster. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.
+     - ``0``
+   * - ``LOCAL_RANK``
+     - Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.


The new docs describe BACKENDAI_CLUSTER_LOCAL_RANK as a global index (line 59), but this table says LOCAL_RANK is a local index while still being derived from BACKENDAI_CLUSTER_LOCAL_RANK. Please reconcile the terminology: either describe LOCAL_RANK as global in this context, or change the derivation/variable names so LOCAL_RANK matches framework expectations.

Suggested change

- Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.

- Zero-based global index of the current container in this Backend.AI cluster (alias of ``RANK`` for compatibility). Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.

Copilot · 2026-04-01T11:16:47Z

  fi

+  # Set up distributed training environment variables from BACKENDAI_* cluster vars.
+  . /opt/kernel/setup_dist_environ.sh


The entrypoint unconditionally sources /opt/kernel/setup_dist_environ.sh. To avoid noisy startup errors (or failures if set -e is introduced later), consider guarding this with a file-existence check and emitting a clear warning only when expected.

Suggested change

. /opt/kernel/setup_dist_environ.sh

if [ -f "/opt/kernel/setup_dist_environ.sh" ]; then

. /opt/kernel/setup_dist_environ.sh

else

echo "WARNING: /opt/kernel/setup_dist_environ.sh not found; distributed training environment variables were not initialized."

fi

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hers WORLD_SIZE, RANK, and LOCAL_RANK are intentionally not pre-set at the container level — launchers like torchrun set these per-process based on GPUs per node, and pre-setting them would conflict in multi-GPU setups. Also assign unique ports per TF worker (base port + rank) to avoid conflicts when multiple workers share a host. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rapsealk · 2026-04-01T11:47:57Z

Coverage for both frameworks

PyTorch: MASTER_ADDR + MASTER_PORT are the two variables that torchrun needs from the environment but cannot auto-discover. WORLD_SIZE, RANK, and LOCAL_RANK are set per-process by torchrun based on GPUs per node.
TensorFlow: TF_CONFIG is the standard mechanism for MultiWorkerMirroredStrategy — it's self-contained (includes full cluster topology and task index). TF does not use MASTER_ADDR/MASTER_PORT.

Both frameworks are fully covered with the current setup.

fregataa

Cannot create a single kernel session, could you check once more? @rapsealk

rapsealk

Missing mount breaks all session startups

entrypoint.sh unconditionally sources the new script:

. /opt/kernel/setup_dist_environ.sh

But setup_dist_environ.sh is not mounted into the container. The mount provisioner (src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py) explicitly lists every file that gets bind-mounted to /opt/kernel/, and this file is missing from _prepare_default_mounts().

In POSIX sh, sourcing a non-existent file with . is a fatal error — the shell exits immediately. This means every session startup crashes, not just single-kernel ones.

Fix

Add the mount in _prepare_default_mounts():

self._parse_mount("runner/setup_dist_environ.sh", "/opt/kernel/setup_dist_environ.sh"),

Guard the source in entrypoint.sh (both root and non-root paths) for robustness against older agent versions:

if [ -f /opt/kernel/setup_dist_environ.sh ]; then
  . /opt/kernel/setup_dist_environ.sh
fi

Minor: stale description & changelog

The PR description table and changes/10726.feature.md still list WORLD_SIZE, RANK, LOCAL_RANK as derived variables, but commit 14320d7 intentionally removed those. These should be updated to reflect the current behavior (only MASTER_ADDR, MASTER_PORT, and TF_CONFIG).

The new distributed training script was sourced unconditionally in entrypoint.sh but never mounted into the container, causing all session startups to fail. Add the missing bind-mount in the krunner mount provisioner and guard both source calls with a file-existence check for robustness against older agent versions. Also update the changelog to reflect that WORLD_SIZE, RANK, and LOCAL_RANK are no longer set (removed in 14320d7). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fregataa · 2026-04-02T04:23:42Z

        return [
            self._parse_mount("runner/extract_dotfiles.py", "/opt/kernel/extract_dotfiles.py"),
            self._parse_mount("runner/entrypoint.sh", "/opt/kernel/entrypoint.sh"),
+            self._parse_mount("runner/setup_dist_environ.sh", "/opt/kernel/setup_dist_environ.sh"),


Please update src/ai/backend/agent/agent.py mount_krunner() method

- Add setup_dist_environ.sh mount to agent.py mount_krunner() (fregataa) - Add setup_dist_environ.sh to Kubernetes copy_runner_files() - Add warning message when setup_dist_environ.sh is not found - Eliminate leaked internal variables by using underscore-prefixed names and unsetting them after use - Save/restore IFS instead of unsetting it to preserve caller state - Use inline expressions instead of intermediate variables for MASTER_ADDR and MASTER_PORT to avoid polluting the environment - Add fallback for BACKENDAI_CLUSTER_LOCAL_RANK in TF_CONFIG JSON Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rapsealk · 2026-04-02T04:34:41Z

Changes since last review

Addressed all feedback from @fregataa, Copilot, and my own production-readiness review:

Fixes applied

Source	Issue	Fix
@fregataa	Missing mount in `agent.py` `mount_krunner()`	Added `setup_dist_environ.sh` bind-mount (`agent.py:672-675`)
Review	Kubernetes `copy_runner_files` missing the script	Added to `target_files` list (`kubernetes/kernel.py:534`)
Review	Internal variables (`BACKENDAI_DIST_MASTER_ADDR`, `TF_WORKER_*`, etc.) leaked into container environment	Eliminated intermediate vars for `MASTER_ADDR`/`MASTER_PORT`; underscore-prefixed + `unset` for TF_CONFIG temps
Review	`BACKENDAI_CLUSTER_LOCAL_RANK` fallback in TF_CONFIG	`${BACKENDAI_CLUSTER_LOCAL_RANK:-0}` prevents malformed JSON
Copilot	`IFS` not restored after mutation	Save/restore with `_OLD_IFS` instead of `unset IFS`
Copilot	No warning when script file not found	Added `else` branch with warning message in both entrypoint paths
Review	PR description listed stale vars (`WORLD_SIZE`, `RANK`, `LOCAL_RANK`)	Updated description and changelog to match actual behavior

Mount provisioners updated (all three)

The script is now registered in all container provisioning paths:

agent.py mount_krunner() — legacy Docker
krunner.py _prepare_default_mounts() — stage-based Docker
kubernetes/kernel.py copy_runner_files() — Kubernetes

All lint, type checks, and pre-commit hooks pass.

Copilot AI review requested due to automatic review settings April 1, 2026 11:12

github-actions bot assigned rapsealk Apr 1, 2026

github-actions bot added size:M 30~100 LoC area:docs Documentations labels Apr 1, 2026

Copilot started reviewing on behalf of rapsealk April 1, 2026 11:14 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

rapsealk mentioned this pull request Apr 1, 2026

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

Closed

4 tasks

rapsealk changed the title ~~feat: derive distributed training env vars from BACKENDAI_* at container startup~~ feat: Derive distributed training env vars from BACKENDAI_* at container startup Apr 1, 2026

Add changelog entry for PR #10726

cb1f79b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added size:L 100~500 LoC and removed size:M 30~100 LoC labels Apr 1, 2026

Merge branch 'main' into feature/distributed-training-env-vars-runner

ae4ebe9

rapsealk requested a review from a team April 1, 2026 11:22

rapsealk added this to the 26.4 milestone Apr 1, 2026

rapsealk changed the title ~~feat: Derive distributed training env vars from BACKENDAI_* at container startup~~ feat(BA-1214): Derive distributed training env vars from BACKENDAI_* at container startup Apr 1, 2026

rapsealk changed the title ~~feat(BA-1214): Derive distributed training env vars from BACKENDAI_* at container startup~~ feat: Derive PyTorch/TensorFlow distributed training env vars at container startup Apr 1, 2026

github-actions bot added size:M 30~100 LoC and removed size:L 100~500 LoC labels Apr 1, 2026

fregataa changed the title ~~feat: Derive PyTorch/TensorFlow distributed training env vars at container startup~~ feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup Apr 2, 2026

fregataa reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feature/distributed-training-env-vars-runner

6449cdd

rapsealk commented Apr 2, 2026

View reviewed changes

github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component and removed size:M 30~100 LoC labels Apr 2, 2026

fregataa reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feature/distributed-training-env-vars-runner

b5e56b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup#10726

feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup#10726
rapsealk wants to merge 8 commits intomainfrom
feature/distributed-training-env-vars-runner

rapsealk commented Apr 1, 2026 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

rapsealk commented Apr 1, 2026

Uh oh!

fregataa left a comment

Uh oh!

rapsealk left a comment

Uh oh!

fregataa Apr 2, 2026 •

edited

Loading

Uh oh!

rapsealk commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-  export LOCAL_RANK="$BACKENDAI_CLUSTER_LOCAL_RANK"
+  # BACKENDAI_CLUSTER_LOCAL_RANK is a global rank across all kernels.
+  # LOCAL_RANK should be node/container-local; default to 0 for one process
+  # per container and let launchers like torchrun override it if needed.
+  export LOCAL_RANK="0"

	- Zero-based local index of the current container. Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.
	- Zero-based global index of the current container in this Backend.AI cluster (alias of ``RANK`` for compatibility). Derived from ``BACKENDAI_CLUSTER_LOCAL_RANK``.

Conversation

rapsealk commented Apr 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Variables derived

Files changed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

rapsealk commented Apr 1, 2026

Coverage for both frameworks

Uh oh!

fregataa left a comment

Choose a reason for hiding this comment

Uh oh!

rapsealk left a comment

Choose a reason for hiding this comment

Missing mount breaks all session startups

Fix

Minor: stale description & changelog

Uh oh!

fregataa Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rapsealk commented Apr 2, 2026

Changes since last review

Fixes applied

Mount provisioners updated (all three)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rapsealk commented Apr 1, 2026 •

edited by github-actions bot

Loading

fregataa Apr 2, 2026 •

edited

Loading