Skip to content

feat(BA-5670): Add optional dedicated SFTP agent in TUI dev installer#10971

Open
Yaminyam wants to merge 5 commits intomainfrom
feat/BA-5670-sftp-agent-installer
Open

feat(BA-5670): Add optional dedicated SFTP agent in TUI dev installer#10971
Yaminyam wants to merge 5 commits intomainfrom
feat/BA-5670-sftp-agent-installer

Conversation

@Yaminyam
Copy link
Copy Markdown
Member

Summary

Adds --with-sftp-agent to the dev-mode TUI installer, which provisions a dedicated SFTP agent alongside the regular compute agent using Backend.AI's multi-agent-per-node feature. The SFTP agent runs under a dedicated upload scaling group so the storage proxy can route SFTP upload sessions to it while regular compute sessions continue to run on the primary agent.

The multi-agent-per-node feature is new — this PR is the first installer-level consumer of it, so the test plan below is intentionally thorough.

How it works

  • TypesCliArgs/InstallVariable gain with_sftp_agent. ServiceConfig gains sftp_agent_enabled, sftp_agent_rpc_addr (6013), sftp_agent_watcher_addr (6015), sftp_agent_ipc_base_path, sftp_agent_var_base_path, sftp_agent_scaling_group = "upload". Defaults are chosen so that nothing collides with the primary agent's 6001/6009/6007/./agent.pid/30000-31000.
  • DevContext.configure_sftp_agent — copies the bundled agent.toml template to agent-sftp.toml and uses scoped regex replacements (via sed_in_place_multi) to override only the fields that would collide: rpc-listen-addr, agent-sock-port, pid-file, scaling-group, id, ipc-base-path, var-base-path, mount-path, port-range, watcher service-addr, and pyroscope app-name. The mount-path is shared with the primary agent so that uploaded files are visible to compute sessions.
  • DevContext.configure_sftp_agent_fixture — registers the upload scaling group and its sgroups_for_domains association via backend.ai mgr fixture populate using an in-memory fixture, wired to the same appproxy coordinator as default.
  • etcd volumes — when with_sftp_agent=True, the volumes payload gains "sftp_scaling_groups": "upload" so storage-proxy knows to route SFTP sessions there.
  • InstallReport UI — gains an SFTP Agent tab with the start command and rpc/watcher addresses when the option is enabled.
  • ./dev script — gains a sftp-agent service (./backend.ai ag start-server -f agent-sftp.toml --debug). Because the primary and SFTP agents share the same backend.ai: agent proctitle, process matching is disambiguated by the presence/absence of agent-sftp.toml in the cmdline, so ./dev stop agent never kills the SFTP agent and vice versa. sftp-agent is optional — it is NOT started by ./dev start all and only appears in ./dev status when agent-sftp.toml exists.

Edge cases addressed

  • agent-sock-port (default 6007) → 6017 for SFTP
  • pid-file (default ./agent.pid) → ./agent-sftp.pid
  • [container] port-range (default 30000-31000) → 31100-31200
  • pyroscope.app-namebackendai-half-sftp-agent
  • Agent id set to i-local-sftp so manager distinguishes heartbeats
  • pgrep patterns scoped to agent-sftp.toml cmdline to avoid cross-kill between primary and SFTP agents

Test plan

The multi-agent-per-node feature is brand new, so this is intentionally exhaustive:

  • Fresh install: ./backendai-install install --with-sftp-agent produces both agent.toml and agent-sftp.toml
  • ./dev start all brings up the primary agent only; ./dev status does NOT list sftp-agent yet because the config exists
  • Actually — verify that once agent-sftp.toml exists, ./dev status DOES list sftp-agent (but still stopped) as the current code shows it conditionally
  • ./dev start sftp-agent brings up the SFTP agent
  • Both agents register: ./bai admin agent search shows 2 agents, one in default and one in upload
  • ./bai admin etcd get volumes/sftp_scaling_groups returns upload
  • ./bai admin scaling-group search shows both default and upload
  • Compute session: ./bai session enqueue --scaling-group default ... runs on the primary agent
  • SFTP upload session via the web UI routes to the SFTP agent (check agent logs)
  • ./dev stop agent kills ONLY the primary agent; SFTP agent remains running
  • ./dev stop sftp-agent kills ONLY the SFTP agent; primary agent remains running
  • ./dev restart sftp-agent is idempotent across multiple invocations
  • Without --with-sftp-agent, the installer behaves exactly as before (no agent-sftp.toml, no sftp_scaling_groups etcd key, no upload scaling group)
  • pants lint/check on changed files (verified locally)

Non-goals

  • No changes to halfstack
  • No changes to the production (pyinfra) installer path — example-users.json.j2 already handles the upload scaling group for production
  • No CLI/SDK changes

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 10, 2026 08:46
@github-actions github-actions bot added the size:L 100~500 LoC label Apr 10, 2026
Yaminyam added a commit that referenced this pull request Apr 10, 2026
Comment thread src/ai/backend/install/context.py Dismissed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in --with-sftp-agent path to the dev-mode TUI installer to provision a second, dedicated Backend.AI agent for SFTP uploads (via a separate upload scaling group) alongside the regular compute agent, plus supporting UI and ./dev helper updates.

Changes:

  • Extend installer args/config types and CLI with --with-sftp-agent.
  • Add dev installer steps to generate agent-sftp.toml, register the upload scaling group fixture, and (attempt to) configure storage routing for SFTP sessions.
  • Update InstallReport UI and ./dev script to expose and manage the optional sftp-agent process separately from the primary agent.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ai/backend/install/types.py Adds with_sftp_agent and SFTP-agent-related fields to installer config models.
src/ai/backend/install/context.py Implements SFTP agent config generation, scaling-group fixture population, and etcd config tweak for SFTP routing.
src/ai/backend/install/cli.py Adds the --with-sftp-agent flag to the installer CLI.
src/ai/backend/install/app.py Adds an InstallReport tab with instructions and addresses for the optional SFTP agent.
dev Adds sftp-agent service handling and disambiguates process matching/killing between the two agents.
changes/BA-5670.feature.md Towncrier fragment documenting the new dev installer option.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +544 to 549
# When a dedicated SFTP agent is configured, point the storage proxy's
# sftp_scaling_groups at that agent's scaling group so that SFTP
# upload sessions get routed through it.
if service.sftp_agent_enabled:
data["volumes"]["sftp_scaling_groups"] = service.sftp_agent_scaling_group
await self.etcd_put_json("", data)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sftp_scaling_groups is being written to volumes/sftp_scaling_groups, but the manager config schema and CLI tooling expect this setting under each proxy (e.g. volumes/proxies/<proxy>/sftp_scaling_groups). With the current key, storage-proxy routing for SFTP sessions likely won’t see the configured scaling group.

Copilot uses AI. Check for mistakes.
Comment thread src/ai/backend/install/context.py Outdated
Comment on lines +658 to +664
# Copy the bundled agent.toml to a distinct destination for the SFTP
# agent without clobbering the primary agent.toml that configure_agent
# has already written.
with self.resource_path("ai.backend.install.configs", "agent.toml") as src_path:
dst_path = Path.cwd() / "agent-sftp.toml"
shutil.copy(src_path, dst_path)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configure_sftp_agent() copies the bundled agent.toml template instead of cloning the already-generated ./agent.toml from configure_agent(). This can easily drift from the primary agent config (e.g., mount-path/etcd settings/plugins) and undermines the stated goal of sharing settings between the two agents; consider copying Path.cwd()/agent.toml and then applying the collision overrides.

Copilot uses AI. Check for mistakes.
Comment thread src/ai/backend/install/context.py Outdated
Comment on lines +719 to +724
# [agent] mount-path — share the same vfolder root as the
# primary agent so that uploaded files are visible.
(
re.compile(r"^mount-path = .*", flags=re.MULTILINE),
f'mount-path = "{self.install_info.base_path / service.vfolder_relpath}"',
),
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SFTP agent’s mount-path is set to {base_path}/{vfolder_relpath}, but the primary agent config path is not derived from the same logic in this installer (its substitutions target mount_path, not mount-path). This makes it very likely that the two agents won’t actually share the same host filesystem path, so uploads may not be visible to compute sessions.

Copilot uses AI. Check for mistakes.
Comment thread dev Outdated
Comment on lines +78 to +88
_kill_matching() {
local signal=$1
if [ "$svc" = "agent" ]; then
pgrep -f "backend.ai: agent" 2>/dev/null | while read -r pid; do
if ! ps -p "$pid" -o command= 2>/dev/null | grep -q "agent-sftp.toml"; then
kill "-${signal}" "$pid" 2>/dev/null || true
fi
done
else
pkill "-${signal}" -f "$(_service_process "$svc")" 2>/dev/null || true
fi
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -euo pipefail, the pgrep | while ... pipeline in _kill_matching() (agent branch) can cause the whole script to exit if the agent process terminates between _is_running and _kill_matching (pgrep returns 1 under pipefail). Guard the pipeline (e.g., || true) or avoid pipefail-sensitive pipelines when stopping services.

Copilot uses AI. Check for mistakes.
Comment thread dev
Comment on lines 244 to 248
[ "$svc" = "all" ] && return 0
_service_cmd "$svc" > /dev/null 2>&1 || {
echo "$(_color red "Unknown service: $svc")"
echo "Valid services: $ALL_SERVICES all"
echo "Valid services: $ALL_SERVICES $OPTIONAL_SERVICES all"
exit 1
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sftp-agent is accepted as a valid service, but there’s no explicit check that ./agent-sftp.toml exists before attempting ./dev start/restart sftp-agent. Since cmd_status already hides the optional service unless the config exists, consider failing fast in _validate_service (or start path) with a clear message when the config is missing.

Copilot uses AI. Check for mistakes.
@Yaminyam Yaminyam requested a review from a team April 14, 2026 01:09
Yaminyam and others added 5 commits April 19, 2026 14:29
Adds a new --with-sftp-agent option to the dev-mode TUI installer that
provisions a dedicated SFTP agent process alongside the regular compute
agent, using Backend.AI's multi-agent-per-node feature.

Behavior:
- CliArgs/InstallVariable gain 'with_sftp_agent'. ServiceConfig gains
  sftp_agent_enabled + distinct rpc/watcher addresses, ipc/var paths,
  and scaling group name ('upload').
- DevContext.configure_sftp_agent() copies agent.toml to agent-sftp.toml
  and rewrites (via regex-scoped sed_in_place_multi) the rpc-listen-addr,
  agent-sock-port, pid-file, scaling-group, id, ipc/var paths, mount-path,
  container port-range (31100-31200 to avoid colliding with the primary
  agent's 30000-31000), watcher service-addr, and pyroscope app-name so
  that both agents can coexist on the same node without conflict.
- DevContext.configure_sftp_agent_fixture() registers an 'upload' scaling
  group associated with the default domain, wired to the same appproxy
  coordinator as the primary scaling group.
- The etcd 'volumes' payload gains a conditional 'sftp_scaling_groups:
  upload' entry so that storage-proxy routes SFTP upload sessions to the
  dedicated agent.
- DevContext.hydrate_install_info populates the new SFTP ServiceConfig
  fields (defaulting sftp_agent_enabled to the with_sftp_agent flag).
- DevContext.configure() invokes configure_sftp_agent /
  configure_sftp_agent_fixture only when with_sftp_agent is True; the
  default install flow is unchanged.

Lifecycle management via ./dev:
- New 'sftp-agent' service with command
  './backend.ai ag start-server -f agent-sftp.toml --debug'.
- Since the primary and SFTP agent share the same 'backend.ai: agent'
  proctitle, pgrep matching is done against the unique 'agent-sftp.toml'
  cmdline fragment for sftp-agent, and the primary 'agent' pgrep excludes
  processes whose cmdline contains agent-sftp.toml. This keeps
  './dev stop agent' from accidentally killing the SFTP agent (and vice
  versa).
- sftp-agent is an OPTIONAL service: './dev start all' and './dev status'
  intentionally do NOT start/list it unless agent-sftp.toml is present.
  It must always be started explicitly via './dev start sftp-agent'.

InstallReport post-install guide gains an SFTP Agent tab when
with_sftp_agent is True, showing the start command and the rpc/watcher
addresses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e tracking

Testing BA-5670 end-to-end surfaced several issues with the initial
implementation; this commit addresses them all.

Rather than mutating ``agent.toml`` with a long list of regex sed
replacements, ship a dedicated ``configs/agent/halfstack-sftp.toml``
sample (same pattern as ``halfstack.toml`` and ``app-proxy-*``). The
sample bakes in every SFTP-specific value:

* distinct rpc/watcher/metric/agent-sock/aiomonitor/metadata-server
  ports so two agent processes can coexist on one node
* ``id = "i-local-sftp"`` so the manager treats the SFTP agent as a
  separate heartbeat source from the primary agent
* ``scaling-group = "upload"``
* ``pid-file = "./agent-sftp.pid"`` / dedicated ipc/var base paths
* ``allow-compute-plugins = []`` — critical: without this the SFTP
  agent would attempt to import every discovered accelerator plugin
  (e.g. Tenstorrent's ``tt_tools_common``) and crash in dev envs
* differentiated pyroscope ``app-name``
* non-overlapping ``[container] port-range``

``DevContext.configure_sftp_agent`` now just copies the sample via
``self.copy_config("agent-sftp.toml")`` (a symlink into
``configs/agent/halfstack-sftp.toml`` using the same relative pattern
as the other service templates) and only rewrites the two values that
genuinely depend on the install environment: the etcd addr port and
the ``mount-path`` absolute path. No more ad-hoc regex list to
maintain.

setproctitle overwrites the argv of agent workers to literally
``backend.ai: agent``, which means the previous approach of
disambiguating primary vs SFTP agent via ``pgrep -f "agent-sftp.toml"``
does not work — neither the parent nor the worker's cmdline actually
contains that string. The only reliable signal is the pid-file that
each agent writes on its own (``./agent.pid`` / ``./agent-sftp.pid``).

Reworked ``_is_running`` / ``_stop_service`` / ``_status_services`` to
dispatch on service name:

* ``agent`` / ``sftp-agent`` → read pid from pid-file, ``kill -0`` to
  test liveness, ``kill`` to stop. On stop we also rm the pid-file so
  subsequent ``status`` checks don't report stale running state.
* everything else → original ``pgrep -f <proctitle>`` path.

This is much simpler than the earlier cmdline-filter scheme and
actually survives setproctitle.

With these changes and a full clean reinstall (halfstack volumes
wiped, fresh ``./backend.ai install --with-sftp-agent``), the
following are verified:

* etcd ``/sorna/local/volumes/sftp_scaling_groups`` = ``upload``
* ``scaling_groups`` DB table has both ``default`` and ``upload``
  with ``sgroups_for_domains`` rows for the default domain
* ``agents`` DB table shows two ALIVE rows:
  - ``i-MacBook-Pro.local`` / ``default`` / ``tcp://127.0.0.1:6011``
  - ``i-local-sftp``        / ``upload``  / ``tcp://127.0.0.1:6013``
* ``./dev start sftp-agent`` brings up the dedicated agent
* ``./dev stop agent`` only kills the primary; SFTP agent stays up
* ``./dev stop sftp-agent`` only kills the SFTP agent; primary stays up
* ``./dev start all`` does NOT auto-start sftp-agent (optional)
* ``./dev status`` correctly distinguishes the two via pid-files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix sftp_scaling_groups etcd path: move from volumes/ to
  volumes/proxies/local/ per manager config schema
- Clone generated agent.toml instead of bundled template so
  environment-specific settings (etcd, mount-path, plugins) are
  automatically inherited from primary agent
- Add scaling_group, agent_sock_port, sftp_agent_sock_port,
  sftp_agent_ipc_base_path fields to ServiceConfig
- Remove unnecessary f-string without placeholders
- Add agent-sftp.toml existence check in ./dev _validate_service
- Fix pgrep pipeline already resolved via pidfile-based approach

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When configure_sftp_agent was changed from copying a pre-authored
template to cloning the generated agent.toml, several implicit port
defaults became collisions:

- service-addr port 6003 (metric API) → 6014
- metadata-server-port 40128 → 40129
- aiomonitor-termui-port 38200 → 38201
- aiomonitor-webui-port 39200 → 39201

Also fix allow-compute-plugins to use empty list [] instead of
comment, since commented-out value causes all plugins to load
(including unavailable accelerator plugins like tenstorrent).

Tested: both agents start and register with correct scaling groups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Yaminyam Yaminyam force-pushed the feat/BA-5670-sftp-agent-installer branch from 1339634 to 4e7b5eb Compare April 19, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants