Skip to content

fix: remove agent from running set on AgentSessionNotFound#2330

Open
btucker wants to merge 1 commit intomainfrom
fix/session-not-found-running
Open

fix: remove agent from running set on AgentSessionNotFound#2330
btucker wants to merge 1 commit intomainfrom
fix/session-not-found-running

Conversation

@btucker
Copy link
Copy Markdown
Owner

@btucker btucker commented Apr 5, 2026

Summary

When a session resume fails with "No conversation found", AgentSessionNotFound cleared session_id but left the agent in the running set. On daemon restart (event replay), these agents reconstructed as "running" and the daemon tried to interact with them in an infinite crash loop.

Observed: btucker-reviewer had 4 "running" records with stale session IDs, each cycling through resume → "No conversation found" → clear session_id → resume again, spamming the daemon log every few seconds.

Fix: AgentSessionNotFound now also removes the agent from the running set, matching the behavior of AgentStopped.

Test plan

  • Existing session_not_found_clears_session_id_on_replay test passes
  • Full test suite: 1536 lib tests, 0 failures
  • cargo clippy clean

🤖 Generated with Claude Code

When a session resume fails with "No conversation found", the daemon
emits AgentSessionNotFound which cleared session_id but left the agent
in the running set. On event replay (daemon restart), these agents
reconstructed as "running" and the daemon tried to interact with them
in an infinite loop.

Now AgentSessionNotFound also removes the agent from the running set,
so stale agents are properly marked as stopped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.68%. Comparing base (05897b0) to head (d11b83e).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2330      +/-   ##
==========================================
- Coverage   63.70%   63.68%   -0.02%     
==========================================
  Files         100      100              
  Lines       38582    38585       +3     
==========================================
- Hits        24578    24574       -4     
- Misses      14004    14011       +7     
Files with missing lines Coverage Δ
src/daemon_v2/projections/agents.rs 98.63% <100.00%> (+0.02%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Owner Author

@btucker btucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed: fix/session-not-found-running

The fix is correct. AgentSessionNotFound was clearing session_id but leaving the agent in the running set, which meant on event replay the daemon would reconstruct these agents as "running" and loop trying to interact with a gone session. Adding self.running.remove(&agent_id) here matches the AgentStopped handler at line 116 — good symmetry.

One thing to tighten up:

Test gapsession_not_found_clears_session_id_on_replay (agents_tests.rs:280) only asserts that session_id is None after the event. It doesn't verify the agent was removed from the running set, which is the actual bug this PR fixes. Worth adding:

assert!(
    idx.is_running("a1") == false,  // or !idx.running.contains("a1")
    "AgentSessionNotFound must remove agent from running set"
);

Without that, the test would still pass even if the running.remove line were reverted.

Otherwise LGTM — the fix is correct, scoped, and matches the existing pattern.

🌃 Co-built with Midtown

Copy link
Copy Markdown
Owner Author

@btucker btucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed — the fix is correct and directly addresses the crash loop. Approve.

Two optional follow-ups (not blocking):

  1. Test gap: session_not_found_clears_session_id_on_replay only asserts session_id is cleared. Worth adding an assertion that the agent is removed from the running set, since that's the actual bug this fixes.

  2. Minor inconsistency with AgentStopped: AgentStopped also clears pid and sets stopped_at. AgentSessionNotFound doesn't. The agent is effectively stopped but retains a stale pid and has no stopped_at timestamp. Probably fine in practice, but could show misleading info in status output.

🌃 Co-built with Midtown

@btucker
Copy link
Copy Markdown
Owner Author

btucker commented Apr 7, 2026

Review

Fix is correct. AgentSessionNotFound clearing session_id without removing from running meant event replay would reconstruct these agents as running (no session_id, but in running set), causing the infinite resume loop described.

The one-line addition self.running.remove(&agent_id) mirrors what AgentStopped does at line 116.

Nit (non-blocking): The existing test session_not_found_clears_session_id_on_replay should also assert !idx.running.contains("a1") to cover the new behavior. Right now the test only checks session_id is cleared but not the running set removal — so a future regression could remove this line and tests would still pass.

LGTM — nice catch on the root cause from the 4 stale btucker-reviewer records.

🌃 Co-built with Midtown

Copy link
Copy Markdown
Owner Author

@btucker btucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Approve with minor suggestions

The fix is correct and well-targeted. AgentSessionNotFound clearing session_id without removing from running is a clear bug — on event replay, the agent reconstructs as "running" with no session, causing the infinite resume loop described. Adding self.running.remove(&agent_id) breaks that cycle.

Two minor suggestions:

  1. Consider also clearing pid and setting stopped_atAgentStopped does both (agent.pid = None; agent.stopped_at = Some(Utc::now())). Without these, an agent removed from running via AgentSessionNotFound will still have a stale pid and no stopped_at timestamp. This shouldn't cause functional issues since decision functions check running.contains(), but it leaves the Agent record in an inconsistent state that could confuse debugging or future code that inspects pid/stopped_at directly.

  2. The existing test should assert on running statesession_not_found_clears_session_id_on_replay doesn't verify the agent was removed from the running set, which is the actual bug this PR fixes. Adding assert!(!idx.running.contains("a1")) would pin the new behavior.

Neither blocks merge — the crash loop fix is the important part and it's correct.

🌃 Co-built with Midtown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant