feat(BA-5670): Add optional dedicated SFTP agent in TUI dev installer#10971
feat(BA-5670): Add optional dedicated SFTP agent in TUI dev installer#10971
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in --with-sftp-agent path to the dev-mode TUI installer to provision a second, dedicated Backend.AI agent for SFTP uploads (via a separate upload scaling group) alongside the regular compute agent, plus supporting UI and ./dev helper updates.
Changes:
- Extend installer args/config types and CLI with
--with-sftp-agent. - Add dev installer steps to generate
agent-sftp.toml, register theuploadscaling group fixture, and (attempt to) configure storage routing for SFTP sessions. - Update InstallReport UI and
./devscript to expose and manage the optionalsftp-agentprocess separately from the primary agent.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/install/types.py | Adds with_sftp_agent and SFTP-agent-related fields to installer config models. |
| src/ai/backend/install/context.py | Implements SFTP agent config generation, scaling-group fixture population, and etcd config tweak for SFTP routing. |
| src/ai/backend/install/cli.py | Adds the --with-sftp-agent flag to the installer CLI. |
| src/ai/backend/install/app.py | Adds an InstallReport tab with instructions and addresses for the optional SFTP agent. |
| dev | Adds sftp-agent service handling and disambiguates process matching/killing between the two agents. |
| changes/BA-5670.feature.md | Towncrier fragment documenting the new dev installer option. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # When a dedicated SFTP agent is configured, point the storage proxy's | ||
| # sftp_scaling_groups at that agent's scaling group so that SFTP | ||
| # upload sessions get routed through it. | ||
| if service.sftp_agent_enabled: | ||
| data["volumes"]["sftp_scaling_groups"] = service.sftp_agent_scaling_group | ||
| await self.etcd_put_json("", data) |
There was a problem hiding this comment.
sftp_scaling_groups is being written to volumes/sftp_scaling_groups, but the manager config schema and CLI tooling expect this setting under each proxy (e.g. volumes/proxies/<proxy>/sftp_scaling_groups). With the current key, storage-proxy routing for SFTP sessions likely won’t see the configured scaling group.
| # Copy the bundled agent.toml to a distinct destination for the SFTP | ||
| # agent without clobbering the primary agent.toml that configure_agent | ||
| # has already written. | ||
| with self.resource_path("ai.backend.install.configs", "agent.toml") as src_path: | ||
| dst_path = Path.cwd() / "agent-sftp.toml" | ||
| shutil.copy(src_path, dst_path) | ||
|
|
There was a problem hiding this comment.
configure_sftp_agent() copies the bundled agent.toml template instead of cloning the already-generated ./agent.toml from configure_agent(). This can easily drift from the primary agent config (e.g., mount-path/etcd settings/plugins) and undermines the stated goal of sharing settings between the two agents; consider copying Path.cwd()/agent.toml and then applying the collision overrides.
| # [agent] mount-path — share the same vfolder root as the | ||
| # primary agent so that uploaded files are visible. | ||
| ( | ||
| re.compile(r"^mount-path = .*", flags=re.MULTILINE), | ||
| f'mount-path = "{self.install_info.base_path / service.vfolder_relpath}"', | ||
| ), |
There was a problem hiding this comment.
The SFTP agent’s mount-path is set to {base_path}/{vfolder_relpath}, but the primary agent config path is not derived from the same logic in this installer (its substitutions target mount_path, not mount-path). This makes it very likely that the two agents won’t actually share the same host filesystem path, so uploads may not be visible to compute sessions.
| _kill_matching() { | ||
| local signal=$1 | ||
| if [ "$svc" = "agent" ]; then | ||
| pgrep -f "backend.ai: agent" 2>/dev/null | while read -r pid; do | ||
| if ! ps -p "$pid" -o command= 2>/dev/null | grep -q "agent-sftp.toml"; then | ||
| kill "-${signal}" "$pid" 2>/dev/null || true | ||
| fi | ||
| done | ||
| else | ||
| pkill "-${signal}" -f "$(_service_process "$svc")" 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
With set -euo pipefail, the pgrep | while ... pipeline in _kill_matching() (agent branch) can cause the whole script to exit if the agent process terminates between _is_running and _kill_matching (pgrep returns 1 under pipefail). Guard the pipeline (e.g., || true) or avoid pipefail-sensitive pipelines when stopping services.
| [ "$svc" = "all" ] && return 0 | ||
| _service_cmd "$svc" > /dev/null 2>&1 || { | ||
| echo "$(_color red "Unknown service: $svc")" | ||
| echo "Valid services: $ALL_SERVICES all" | ||
| echo "Valid services: $ALL_SERVICES $OPTIONAL_SERVICES all" | ||
| exit 1 |
There was a problem hiding this comment.
sftp-agent is accepted as a valid service, but there’s no explicit check that ./agent-sftp.toml exists before attempting ./dev start/restart sftp-agent. Since cmd_status already hides the optional service unless the config exists, consider failing fast in _validate_service (or start path) with a clear message when the config is missing.
Adds a new --with-sftp-agent option to the dev-mode TUI installer that
provisions a dedicated SFTP agent process alongside the regular compute
agent, using Backend.AI's multi-agent-per-node feature.
Behavior:
- CliArgs/InstallVariable gain 'with_sftp_agent'. ServiceConfig gains
sftp_agent_enabled + distinct rpc/watcher addresses, ipc/var paths,
and scaling group name ('upload').
- DevContext.configure_sftp_agent() copies agent.toml to agent-sftp.toml
and rewrites (via regex-scoped sed_in_place_multi) the rpc-listen-addr,
agent-sock-port, pid-file, scaling-group, id, ipc/var paths, mount-path,
container port-range (31100-31200 to avoid colliding with the primary
agent's 30000-31000), watcher service-addr, and pyroscope app-name so
that both agents can coexist on the same node without conflict.
- DevContext.configure_sftp_agent_fixture() registers an 'upload' scaling
group associated with the default domain, wired to the same appproxy
coordinator as the primary scaling group.
- The etcd 'volumes' payload gains a conditional 'sftp_scaling_groups:
upload' entry so that storage-proxy routes SFTP upload sessions to the
dedicated agent.
- DevContext.hydrate_install_info populates the new SFTP ServiceConfig
fields (defaulting sftp_agent_enabled to the with_sftp_agent flag).
- DevContext.configure() invokes configure_sftp_agent /
configure_sftp_agent_fixture only when with_sftp_agent is True; the
default install flow is unchanged.
Lifecycle management via ./dev:
- New 'sftp-agent' service with command
'./backend.ai ag start-server -f agent-sftp.toml --debug'.
- Since the primary and SFTP agent share the same 'backend.ai: agent'
proctitle, pgrep matching is done against the unique 'agent-sftp.toml'
cmdline fragment for sftp-agent, and the primary 'agent' pgrep excludes
processes whose cmdline contains agent-sftp.toml. This keeps
'./dev stop agent' from accidentally killing the SFTP agent (and vice
versa).
- sftp-agent is an OPTIONAL service: './dev start all' and './dev status'
intentionally do NOT start/list it unless agent-sftp.toml is present.
It must always be started explicitly via './dev start sftp-agent'.
InstallReport post-install guide gains an SFTP Agent tab when
with_sftp_agent is True, showing the start command and the rpc/watcher
addresses.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e tracking Testing BA-5670 end-to-end surfaced several issues with the initial implementation; this commit addresses them all. Rather than mutating ``agent.toml`` with a long list of regex sed replacements, ship a dedicated ``configs/agent/halfstack-sftp.toml`` sample (same pattern as ``halfstack.toml`` and ``app-proxy-*``). The sample bakes in every SFTP-specific value: * distinct rpc/watcher/metric/agent-sock/aiomonitor/metadata-server ports so two agent processes can coexist on one node * ``id = "i-local-sftp"`` so the manager treats the SFTP agent as a separate heartbeat source from the primary agent * ``scaling-group = "upload"`` * ``pid-file = "./agent-sftp.pid"`` / dedicated ipc/var base paths * ``allow-compute-plugins = []`` — critical: without this the SFTP agent would attempt to import every discovered accelerator plugin (e.g. Tenstorrent's ``tt_tools_common``) and crash in dev envs * differentiated pyroscope ``app-name`` * non-overlapping ``[container] port-range`` ``DevContext.configure_sftp_agent`` now just copies the sample via ``self.copy_config("agent-sftp.toml")`` (a symlink into ``configs/agent/halfstack-sftp.toml`` using the same relative pattern as the other service templates) and only rewrites the two values that genuinely depend on the install environment: the etcd addr port and the ``mount-path`` absolute path. No more ad-hoc regex list to maintain. setproctitle overwrites the argv of agent workers to literally ``backend.ai: agent``, which means the previous approach of disambiguating primary vs SFTP agent via ``pgrep -f "agent-sftp.toml"`` does not work — neither the parent nor the worker's cmdline actually contains that string. The only reliable signal is the pid-file that each agent writes on its own (``./agent.pid`` / ``./agent-sftp.pid``). Reworked ``_is_running`` / ``_stop_service`` / ``_status_services`` to dispatch on service name: * ``agent`` / ``sftp-agent`` → read pid from pid-file, ``kill -0`` to test liveness, ``kill`` to stop. On stop we also rm the pid-file so subsequent ``status`` checks don't report stale running state. * everything else → original ``pgrep -f <proctitle>`` path. This is much simpler than the earlier cmdline-filter scheme and actually survives setproctitle. With these changes and a full clean reinstall (halfstack volumes wiped, fresh ``./backend.ai install --with-sftp-agent``), the following are verified: * etcd ``/sorna/local/volumes/sftp_scaling_groups`` = ``upload`` * ``scaling_groups`` DB table has both ``default`` and ``upload`` with ``sgroups_for_domains`` rows for the default domain * ``agents`` DB table shows two ALIVE rows: - ``i-MacBook-Pro.local`` / ``default`` / ``tcp://127.0.0.1:6011`` - ``i-local-sftp`` / ``upload`` / ``tcp://127.0.0.1:6013`` * ``./dev start sftp-agent`` brings up the dedicated agent * ``./dev stop agent`` only kills the primary; SFTP agent stays up * ``./dev stop sftp-agent`` only kills the SFTP agent; primary stays up * ``./dev start all`` does NOT auto-start sftp-agent (optional) * ``./dev status`` correctly distinguishes the two via pid-files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix sftp_scaling_groups etcd path: move from volumes/ to volumes/proxies/local/ per manager config schema - Clone generated agent.toml instead of bundled template so environment-specific settings (etcd, mount-path, plugins) are automatically inherited from primary agent - Add scaling_group, agent_sock_port, sftp_agent_sock_port, sftp_agent_ipc_base_path fields to ServiceConfig - Remove unnecessary f-string without placeholders - Add agent-sftp.toml existence check in ./dev _validate_service - Fix pgrep pipeline already resolved via pidfile-based approach Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When configure_sftp_agent was changed from copying a pre-authored template to cloning the generated agent.toml, several implicit port defaults became collisions: - service-addr port 6003 (metric API) → 6014 - metadata-server-port 40128 → 40129 - aiomonitor-termui-port 38200 → 38201 - aiomonitor-webui-port 39200 → 39201 Also fix allow-compute-plugins to use empty list [] instead of comment, since commented-out value causes all plugins to load (including unavailable accelerator plugins like tenstorrent). Tested: both agents start and register with correct scaling groups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1339634 to
4e7b5eb
Compare
Summary
Adds
--with-sftp-agentto the dev-mode TUI installer, which provisions a dedicated SFTP agent alongside the regular compute agent using Backend.AI's multi-agent-per-node feature. The SFTP agent runs under a dedicateduploadscaling group so the storage proxy can route SFTP upload sessions to it while regular compute sessions continue to run on the primary agent.The multi-agent-per-node feature is new — this PR is the first installer-level consumer of it, so the test plan below is intentionally thorough.
How it works
CliArgs/InstallVariablegainwith_sftp_agent.ServiceConfiggainssftp_agent_enabled,sftp_agent_rpc_addr(6013),sftp_agent_watcher_addr(6015),sftp_agent_ipc_base_path,sftp_agent_var_base_path,sftp_agent_scaling_group = "upload". Defaults are chosen so that nothing collides with the primary agent's 6001/6009/6007/./agent.pid/30000-31000.agent.tomltemplate toagent-sftp.tomland uses scoped regex replacements (viased_in_place_multi) to override only the fields that would collide:rpc-listen-addr,agent-sock-port,pid-file,scaling-group,id,ipc-base-path,var-base-path,mount-path,port-range, watcherservice-addr, and pyroscopeapp-name. Themount-pathis shared with the primary agent so that uploaded files are visible to compute sessions.uploadscaling group and itssgroups_for_domainsassociation viabackend.ai mgr fixture populateusing an in-memory fixture, wired to the same appproxy coordinator asdefault.with_sftp_agent=True, thevolumespayload gains"sftp_scaling_groups": "upload"so storage-proxy knows to route SFTP sessions there../devscript — gains asftp-agentservice (./backend.ai ag start-server -f agent-sftp.toml --debug). Because the primary and SFTP agents share the samebackend.ai: agentproctitle, process matching is disambiguated by the presence/absence ofagent-sftp.tomlin the cmdline, so./dev stop agentnever kills the SFTP agent and vice versa.sftp-agentis optional — it is NOT started by./dev start alland only appears in./dev statuswhenagent-sftp.tomlexists.Edge cases addressed
agent-sock-port(default 6007) →6017for SFTPpid-file(default./agent.pid) →./agent-sftp.pid[container] port-range(default30000-31000) →31100-31200pyroscope.app-name→backendai-half-sftp-agentidset toi-local-sftpso manager distinguishes heartbeatspgreppatterns scoped toagent-sftp.tomlcmdline to avoid cross-kill between primary and SFTP agentsTest plan
The multi-agent-per-node feature is brand new, so this is intentionally exhaustive:
./backendai-install install --with-sftp-agentproduces bothagent.tomlandagent-sftp.toml./dev start allbrings up the primary agent only;./dev statusdoes NOT listsftp-agentyet because the config existsagent-sftp.tomlexists,./dev statusDOES listsftp-agent(but still stopped) as the current code shows it conditionally./dev start sftp-agentbrings up the SFTP agent./bai admin agent searchshows 2 agents, one indefaultand one inupload./bai admin etcd get volumes/sftp_scaling_groupsreturnsupload./bai admin scaling-group searchshows bothdefaultandupload./bai session enqueue --scaling-group default ...runs on the primary agent./dev stop agentkills ONLY the primary agent; SFTP agent remains running./dev stop sftp-agentkills ONLY the SFTP agent; primary agent remains running./dev restart sftp-agentis idempotent across multiple invocations--with-sftp-agent, the installer behaves exactly as before (noagent-sftp.toml, nosftp_scaling_groupsetcd key, nouploadscaling group)pants lint/checkon changed files (verified locally)Non-goals
example-users.json.j2already handles the upload scaling group for production🤖 Generated with Claude Code