fix: remove agent from running set on AgentSessionNotFound#2330
fix: remove agent from running set on AgentSessionNotFound#2330
Conversation
When a session resume fails with "No conversation found", the daemon emits AgentSessionNotFound which cleared session_id but left the agent in the running set. On event replay (daemon restart), these agents reconstructed as "running" and the daemon tried to interact with them in an infinite loop. Now AgentSessionNotFound also removes the agent from the running set, so stale agents are properly marked as stopped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2330 +/- ##
==========================================
- Coverage 63.70% 63.68% -0.02%
==========================================
Files 100 100
Lines 38582 38585 +3
==========================================
- Hits 24578 24574 -4
- Misses 14004 14011 +7
🚀 New features to boost your workflow:
|
btucker
left a comment
There was a problem hiding this comment.
Reviewed: fix/session-not-found-running
The fix is correct. AgentSessionNotFound was clearing session_id but leaving the agent in the running set, which meant on event replay the daemon would reconstruct these agents as "running" and loop trying to interact with a gone session. Adding self.running.remove(&agent_id) here matches the AgentStopped handler at line 116 — good symmetry.
One thing to tighten up:
Test gap — session_not_found_clears_session_id_on_replay (agents_tests.rs:280) only asserts that session_id is None after the event. It doesn't verify the agent was removed from the running set, which is the actual bug this PR fixes. Worth adding:
assert!(
idx.is_running("a1") == false, // or !idx.running.contains("a1")
"AgentSessionNotFound must remove agent from running set"
);Without that, the test would still pass even if the running.remove line were reverted.
Otherwise LGTM — the fix is correct, scoped, and matches the existing pattern.
🌃 Co-built with Midtown
btucker
left a comment
There was a problem hiding this comment.
Reviewed — the fix is correct and directly addresses the crash loop. Approve.
Two optional follow-ups (not blocking):
-
Test gap:
session_not_found_clears_session_id_on_replayonly assertssession_idis cleared. Worth adding an assertion that the agent is removed from the running set, since that's the actual bug this fixes. -
Minor inconsistency with
AgentStopped:AgentStoppedalso clearspidand setsstopped_at.AgentSessionNotFounddoesn't. The agent is effectively stopped but retains a stalepidand has nostopped_attimestamp. Probably fine in practice, but could show misleading info in status output.
🌃 Co-built with Midtown
ReviewFix is correct. The one-line addition Nit (non-blocking): The existing test LGTM — nice catch on the root cause from the 4 stale btucker-reviewer records. 🌃 Co-built with Midtown |
btucker
left a comment
There was a problem hiding this comment.
Verdict: Approve with minor suggestions
The fix is correct and well-targeted. AgentSessionNotFound clearing session_id without removing from running is a clear bug — on event replay, the agent reconstructs as "running" with no session, causing the infinite resume loop described. Adding self.running.remove(&agent_id) breaks that cycle.
Two minor suggestions:
-
Consider also clearing
pidand settingstopped_at—AgentStoppeddoes both (agent.pid = None; agent.stopped_at = Some(Utc::now())). Without these, an agent removed fromrunningviaAgentSessionNotFoundwill still have a stalepidand nostopped_attimestamp. This shouldn't cause functional issues since decision functions checkrunning.contains(), but it leaves theAgentrecord in an inconsistent state that could confuse debugging or future code that inspectspid/stopped_atdirectly. -
The existing test should assert on running state —
session_not_found_clears_session_id_on_replaydoesn't verify the agent was removed from the running set, which is the actual bug this PR fixes. Addingassert!(!idx.running.contains("a1"))would pin the new behavior.
Neither blocks merge — the crash loop fix is the important part and it's correct.
🌃 Co-built with Midtown
Summary
When a session resume fails with "No conversation found",
AgentSessionNotFoundclearedsession_idbut left the agent in therunningset. On daemon restart (event replay), these agents reconstructed as "running" and the daemon tried to interact with them in an infinite crash loop.Observed:
btucker-reviewerhad 4 "running" records with stale session IDs, each cycling through resume → "No conversation found" → clear session_id → resume again, spamming the daemon log every few seconds.Fix:
AgentSessionNotFoundnow also removes the agent from the running set, matching the behavior ofAgentStopped.Test plan
session_not_found_clears_session_id_on_replaytest passescargo clippyclean🤖 Generated with Claude Code