fix: retry metadata/agent-state writes on transient network failure#159
fix: retry metadata/agent-state writes on transient network failure#159eusip wants to merge 2 commits intohappier-dev:devfrom
Conversation
updateMetadataBestEffort and updateAgentStateBestEffort previously made a single fire-and-forget attempt. On flaky or mobile connections a transient error would silently swallow the write, leaving the session without a persisted vendorResumeId (e.g. claudeSessionId) and making the session permanently non-resumable. Change both helpers to retry up to 3 times (immediate → +1 s → +2 s) before giving up. Failures on intermediate attempts are logged at debug level; only the final failure is flagged. Behaviour on success is unchanged.
WalkthroughIntroduces a shared retry mechanism ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR wraps both The implementation is correct for the current constants: the two-element delay array is indexed only for non-final attempts (indices 0 and 1), so no out-of-bounds access occurs today. Two minor P2 notes: (1) the delay array size is implicitly coupled to Confidence Score: 5/5Safe to merge — all findings are P2 style/maintainability concerns with no present-defect risk. Single file change, core retry logic is correct, happy path unchanged, both helpers behave consistently. Remaining comments are non-blocking quality suggestions. No files require special attention.
|
| Filename | Overview |
|---|---|
| apps/cli/src/api/session/sessionWritesBestEffort.ts | Adds a withRetry helper (3 attempts, 1 s / 2 s backoff) and wires both best-effort write helpers into it; logic is correct for current constants, two minor P2 maintainability concerns around delay-array sizing and blanket retry on all errors. |
Sequence Diagram
sequenceDiagram
participant C as Caller
participant W as withRetry
participant S as session.update*
C->>W: void withRetry(fn, onFailure)
loop attempt 1..3
W->>S: fn() [updateAgentState / updateMetadata]
alt success
S-->>W: resolved
W-->>C: return (fire-and-forget)
else failure
S-->>W: throws / rejects
W->>W: onFailure(error, attempt, isFinal)
alt not final attempt
W->>W: await delay (1 s then 2 s)
end
end
end
Note over W: after 3 failures: log final debug, give up
Reviews (1): Last reviewed commit: "fix: retry metadata/agent-state writes o..." | Re-trigger Greptile
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/cli/src/api/session/sessionWritesBestEffort.ts (1)
4-5: Couple the delay schedule to the attempt count to avoid silent drift.
BEST_EFFORT_RETRY_DELAYS_MSmust have exactlyBEST_EFFORT_MAX_ATTEMPTS - 1entries for the indexing at line 20 to remain correct. If someone later bumpsBEST_EFFORT_MAX_ATTEMPTSto 4,BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]silently becomesundefinedandsetTimeout(resolve, undefined)collapses to a ~1 ms wait rather than the intended backoff — a subtle regression. Derive one from the other, or assert the invariant.♻️ Suggested refactor
-const BEST_EFFORT_MAX_ATTEMPTS = 3; -const BEST_EFFORT_RETRY_DELAYS_MS = [1_000, 2_000]; +const BEST_EFFORT_RETRY_DELAYS_MS = [1_000, 2_000] as const; +const BEST_EFFORT_MAX_ATTEMPTS = BEST_EFFORT_RETRY_DELAYS_MS.length + 1;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/cli/src/api/session/sessionWritesBestEffort.ts` around lines 4 - 5, BEST_EFFORT_RETRY_DELAYS_MS is brittle because BEST_EFFORT_MAX_ATTEMPTS can change while the fixed array length must be exactly BEST_EFFORT_MAX_ATTEMPTS - 1 for the code that uses BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]; replace the hardcoded array with a derived array (e.g., generate an exponential/backoff list of length BEST_EFFORT_MAX_ATTEMPTS - 1 from a base delay) or add a runtime assert that throws if BEST_EFFORT_RETRY_DELAYS_MS.length !== BEST_EFFORT_MAX_ATTEMPTS - 1; update the constants at the top (BEST_EFFORT_MAX_ATTEMPTS and BEST_EFFORT_RETRY_DELAYS_MS) so the generated delays are always in sync with the attempts used by the retry logic that indexes by attempt - 1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/cli/src/api/session/sessionWritesBestEffort.ts`:
- Around line 7-25: withRetry currently leaves Node timers referenced and has no
cancellation, so pending retries can keep the process alive and waste attempts;
modify withRetry to accept an optional AbortSignal parameter (e.g.,
withRetry(fn, onFailure, signal?: AbortSignal)), check signal.aborted before
each attempt and treat an abort as an immediate stop (call onFailure with
isFinal true or simply return), and when scheduling the retry use a reference to
the timer (const t = setTimeout(...)) and call t.unref() so retries don't keep
the event loop alive; additionally, if a signal is provided, listen for
signal.onabort to clearTimeout(t) and short-circuit remaining retries.
---
Nitpick comments:
In `@apps/cli/src/api/session/sessionWritesBestEffort.ts`:
- Around line 4-5: BEST_EFFORT_RETRY_DELAYS_MS is brittle because
BEST_EFFORT_MAX_ATTEMPTS can change while the fixed array length must be exactly
BEST_EFFORT_MAX_ATTEMPTS - 1 for the code that uses
BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]; replace the hardcoded array with a
derived array (e.g., generate an exponential/backoff list of length
BEST_EFFORT_MAX_ATTEMPTS - 1 from a base delay) or add a runtime assert that
throws if BEST_EFFORT_RETRY_DELAYS_MS.length !== BEST_EFFORT_MAX_ATTEMPTS - 1;
update the constants at the top (BEST_EFFORT_MAX_ATTEMPTS and
BEST_EFFORT_RETRY_DELAYS_MS) so the generated delays are always in sync with the
attempts used by the retry logic that indexes by attempt - 1.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2baa2312-53a6-4124-bc02-83e993bc6c2b
📒 Files selected for processing (1)
apps/cli/src/api/session/sessionWritesBestEffort.ts
…tempts - Derive BEST_EFFORT_MAX_ATTEMPTS from BEST_EFFORT_RETRY_DELAYS_MS.length + 1 so the two constants can't silently drift out of sync - unref() the retry timer so pending best-effort retries never hold the Node process open past daemon shutdown - Add JSDoc to withRetry, updateAgentStateBestEffort, updateMetadataBestEffort to satisfy docstring coverage requirement; document intentional retry-all behaviour
Problem
updateMetadataBestEffortandupdateAgentStateBestEffortmake a single fire-and-forget API call. On mobile or flaky connections a transient network error silently drops the write. The most visible consequence is thatclaudeSessionId(and equivalent vendor session IDs for other agents) may never be persisted in the session metadata, making the session permanently non-resumable — the eligibility check returnsreasonCode: 'vendor_resume_id_missing'and the UI correctly hides the resume option.Fix
Introduce a shared
withRetryhelper that retries up to 3 times (immediate → +1 s → +2 s backoff) before giving up. BothupdateMetadataBestEffortandupdateAgentStateBestEffortnow use it.debuglevel with attempt countdebuglevel with total attempts, same as beforeWhy both functions
updateAgentStateBestEfforthas the same fire-and-forget pattern and would benefit from the same resilience. Fixing both keeps the two helpers consistent and prevents the same class of bug from being rediscovered later.Scope
Single file change (
sessionWritesBestEffort.ts). No new dependencies.Summary by CodeRabbit