fix(runner): prevent ghost sessions from orphaned spawn webhooks#440
fix(runner): prevent ghost sessions from orphaned spawn webhooks#440BigKunLun wants to merge 1 commit intotiann:mainfrom
Conversation
spawnSession() registers a tracked session entry before the child's "Session started" webhook arrives. When the 15s webhook timeout fires, only pidToAwaiter / pidToErrorAwaiter are cleared; the pidToTrackedSession entry is left in place and the detached child keeps running. If the child eventually starts and reports its webhook, onHappySessionWebhook() still finds the stale tracking entry and promotes the orphan into a normal runner-managed session, surfacing a message-less "ghost session" in the web UI. This is easy to reproduce on opus[1m] --resume: observed real-world case where four rapid "resume" clicks produced four orphan children that all reported their webhooks ~60 minutes later, creating four ghost sessions on the dashboard. Three-part fix: 1. Make the webhook timeout configurable via HAPI_RUNNER_WEBHOOK_TIMEOUT_MS (default unchanged at 15_000). Users on slow models / large resumes can raise the ceiling so the timeout never fires in the first place. 2. On timeout, also delete the pidToTrackedSession entry and SIGTERM the child, so a late webhook cannot promote the orphan. 3. Defence in depth: if onHappySessionWebhook() receives a webhook from a PID that is not tracked but whose payload claims startedBy: 'runner', ignore it and SIGTERM the child. Genuine terminal-launched children correctly report startedBy: 'terminal', so this branch cannot false-positive on them.
There was a problem hiding this comment.
Findings
- [Major] Timeout/orphan cleanup only signals the wrapper process, not the session process tree — both new branches use a direct
SIGTERM(process.kill(pid, 'SIGTERM')/happyProcess?.kill('SIGTERM')), but runner-owned sessions already usekillProcessByChildProcess()because the actual agent subprocesses can outlive the top-level HAPI child. That means a timed-out/orphaned session can keep running after the runner has already reported failure and dropped tracking, which still leaves stray agent processes mutating the worktree. Evidencecli/src/runner/run.ts:208,cli/src/runner/run.ts:567; related existing handlingcli/src/runner/run.ts:644,cli/src/utils/process.ts:77.
Suggested fix:const orphan = pidToTrackedSession.get(pid); if (orphan?.childProcess) { void killProcessByChildProcess(orphan.childProcess); }
- [Major] Timed-out worktree spawns can now leak their temporary worktree — the new timeout branch resolves immediately after sending
SIGTERM, butmaybeCleanupWorktree('spawn-error')only removes the worktree if the child is already gone. In practice the child is usually still alive in that same tick, so cleanup is skipped andonChildExited()never retries it. A slowworktreesession that times out will now leave the created worktree behind. Evidencecli/src/runner/run.ts:567,cli/src/runner/run.ts:575; related cleanup gatecli/src/runner/run.ts:329and exit pathcli/src/runner/run.ts:670.
Suggested fix:let cleanupWorktreeAfterTimeout = false; const timeout = setTimeout(() => { cleanupWorktreeAfterTimeout = true; void killProcessByChildProcess(happyProcess!); resolve({ type: 'error', errorMessage: buildWebhookFailureMessage('timeout') }); }, webhookTimeoutMs); happyProcess.on('exit', () => { if (cleanupWorktreeAfterTimeout) { void cleanupWorktree(); } });
Summary
Review mode: initial
Two correctness regressions in the new timeout/orphan cleanup path: the new termination logic bypasses the existing tree-kill helper, and the timeout path can skip worktree cleanup permanently after returning an error.
Testing
Not run (automation/security policy). Suggested coverage: a fake-timer runner test that times out a worktree spawn and asserts both the child process tree and the created worktree are cleaned up.
HAPI Bot
| `[RUNNER RUN] Ignoring late webhook from orphaned runner-spawned PID ${pid} (session ${sessionId}). Terminating child.` | ||
| ); | ||
| try { | ||
| process.kill(pid, 'SIGTERM'); |
There was a problem hiding this comment.
[MAJOR] This only signals the top-level HAPI child. Runner-owned sessions already use killProcessByChildProcess() elsewhere because Claude/Codex/Cursor descendants can outlive the wrapper, so this branch can still leave the real agent subtree running after tracking has been dropped. Suggested fix:
const orphan = pidToTrackedSession.get(pid);
if (orphan?.childProcess) {
void killProcessByChildProcess(orphan.childProcess);
}| // to clean up; even if the signal is ignored we already removed | ||
| // their tracking entry above, so they cannot be promoted. | ||
| try { | ||
| happyProcess?.kill('SIGTERM'); |
There was a problem hiding this comment.
[MAJOR] Same tree-kill issue here, and this now races worktree cleanup: maybeCleanupWorktree(spawn-error) only removes the worktree if the PID is already dead, but this branch resolves immediately after sending SIGTERM, so timed-out worktree sessions will usually skip cleanup and never retry it. Suggested fix: kill the full process tree here and trigger cleanupWorktree() after the child has actually exited.
Summary
spawnSession()incli/src/runner/run.tscurrently registers a tracked session entry before the child'sSession startedwebhook arrives. If the 15-second webhook timeout fires, onlypidToAwaiter/pidToErrorAwaiterare cleaned; thepidToTrackedSessionentry is left in place and the detached child keeps running. When the child eventually starts and reports its webhook,onHappySessionWebhook()finds the stale tracking entry and silently promotes the orphan into a normal runner-managed session — a message-less "ghost session" shows up on the web UI, and the user has no idea where it came from.Reproduction
Seen repeatedly on
opus[1m] --resumeon a macOS machine running the current Homebrew build (0.16.6). Concretely:POST /api/sessions/:id/resume 500after ~15s.In one user's dashboard this produced four ghost
Obsidian1210sessions from a single troubled session, all rooted at the same--resume <sessionId>.Relevant log excerpt:
Note the ~60-minute gap between
Waiting for session webhookand the children actually reporting in — far beyond the current hard-coded 15s ceiling.Fix
Three small, composable changes in
cli/src/runner/run.ts:Make the webhook timeout configurable via
HAPI_RUNNER_WEBHOOK_TIMEOUT_MS(default unchanged at15_000). Users on slow models / large resumes can raise the ceiling so the timeout never fires in the first place. Zero behaviour change for existing users.On timeout, clean
pidToTrackedSessionandSIGTERMthe child. Without this, the orphan's late webhook can still be silently promoted into a live session byonHappySessionWebhook().Defence in depth in
onHappySessionWebhook(): if a webhook arrives for a PID that has no tracked session but its payload claimsstartedBy: 'runner', treat it as an orphan — log it,SIGTERMthe child, and return without registering the session. Genuine terminal-launched children correctly reportstartedBy: 'terminal'(seecommands/claude.ts,codex/loop.ts,opencode/runOpencode.ts), so this branch cannot false-positive on them.The original hard-coded timeout even carried a
// I have seen timeouts on 10 seconds even though session was still created successfully in ~2 more secondscomment — the race it describes is the same race exploited in (2) and (3).Out of scope / follow-ups
runner.integration.test.tscovers most of this flow but exercising the 60-min late-webhook window requires wall-clock mocking. Happy to add a regression test if reviewers prefer — just say the word and I'll wire one in withvi.useFakeTimers().cli/src/runner/README.mdordocs/; let me know if you'd likeHAPI_RUNNER_WEBHOOK_TIMEOUT_MSdocumented there.Verification
Manually reviewed against:
spawnSession()(spawn → awaiter Promise)onHappySessionWebhook()(webhook dispatch)stopSession()/onChildExited()(cleanup paths — no new leaks)controlServer.ts(webhook payload shape —startedByis always present)Built locally is not currently possible on the submitter's machine (
bunnot installed), but each change is small enough to be reviewed by eye and the CI will catch any typecheck slip. I can provide a local smoke-test log from a Homebrew-replaced binary after merge if useful.Why this matters
Until the fix ships, every slow
--resumeon a slow-starting model produces a compounding pile of ghost sessions that the user can only clean up manually viaDELETE /api/sessions/:id(and only after firstPOST .../archiveto clear theactiveflag). The underlying bug has been latent sincedetached: truewas introduced; it only becomes visible when the model's boot time crosses the 15s ceiling, whichopus[1m]does routinely.Thanks for maintaining HAPI 🙏