Skip to content

fix: retry metadata/agent-state writes on transient network failure#159

Open
eusip wants to merge 2 commits intohappier-dev:devfrom
eusip:fix/metadata-write-retry
Open

fix: retry metadata/agent-state writes on transient network failure#159
eusip wants to merge 2 commits intohappier-dev:devfrom
eusip:fix/metadata-write-retry

Conversation

@eusip
Copy link
Copy Markdown

@eusip eusip commented Apr 23, 2026

Problem

updateMetadataBestEffort and updateAgentStateBestEffort make a single fire-and-forget API call. On mobile or flaky connections a transient network error silently drops the write. The most visible consequence is that claudeSessionId (and equivalent vendor session IDs for other agents) may never be persisted in the session metadata, making the session permanently non-resumable — the eligibility check returns reasonCode: 'vendor_resume_id_missing' and the UI correctly hides the resume option.

Fix

Introduce a shared withRetry helper that retries up to 3 times (immediate → +1 s → +2 s backoff) before giving up. Both updateMetadataBestEffort and updateAgentStateBestEffort now use it.

  • Intermediate failures log at debug level with attempt count
  • Final failure logs at debug level with total attempts, same as before
  • Success path is unchanged — no extra delay on the happy path
  • No schema or protocol changes

Why both functions

updateAgentStateBestEffort has the same fire-and-forget pattern and would benefit from the same resilience. Fixing both keeps the two helpers consistent and prevents the same class of bug from being rediscovered later.

Scope

Single file change (sessionWritesBestEffort.ts). No new dependencies.

Summary by CodeRabbit

  • Bug Fixes
    • Improved resilience of best-effort write operations with automatic retry logic. Operations now attempt up to three times with delays between failed attempts, increasing the likelihood of successful completion under temporary failures.
    • Enhanced debugging information to better distinguish between intermediate retry failures and final failures after exhausting all retry attempts, improving troubleshooting capabilities.

updateMetadataBestEffort and updateAgentStateBestEffort previously made
a single fire-and-forget attempt. On flaky or mobile connections a
transient error would silently swallow the write, leaving the session
without a persisted vendorResumeId (e.g. claudeSessionId) and making
the session permanently non-resumable.

Change both helpers to retry up to 3 times (immediate → +1 s → +2 s)
before giving up. Failures on intermediate attempts are logged at debug
level; only the final failure is flagged. Behaviour on success is
unchanged.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Walkthrough

Introduces a shared retry mechanism (withRetry) to replace immediate try/catch handling in session write best-effort operations. The mechanism attempts up to 3 retries with configurable delays between attempts, while improving logging to distinguish intermediate failures from final failures after exhausting retries.

Changes

Cohort / File(s) Summary
Best-Effort Write Retry Logic
apps/cli/src/api/session/sessionWritesBestEffort.ts
Refactors updateAgentStateBestEffort and updateMetadataBestEffort to use a new withRetry wrapper. Implements up to 3 retry attempts with delays from BEST_EFFORT_RETRY_DELAYS_MS between failures. Enhanced debug logging differentiates intermediate retry failures from final exhaustion of retries. Exported function signatures remain unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding retry logic for metadata and agent-state writes to handle transient network failures.
Description check ✅ Passed The description includes comprehensive sections covering Problem, Fix, Why both functions, and Scope, addressing the template's Summary and Why requirements thoroughly, though formal How to test steps are not explicitly numbered.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR wraps both updateAgentStateBestEffort and updateMetadataBestEffort in a new withRetry helper that retries up to 3 times (0 s → 1 s → 2 s backoff) before giving up, addressing silent write drops on flaky connections that could leave claudeSessionId un-persisted and sessions permanently non-resumable.

The implementation is correct for the current constants: the two-element delay array is indexed only for non-final attempts (indices 0 and 1), so no out-of-bounds access occurs today. Two minor P2 notes: (1) the delay array size is implicitly coupled to BEST_EFFORT_MAX_ATTEMPTS — if one changes without the other, retries silently become 0 ms; (2) all error types are retried, not just transient ones, adding up to 3 s of unnecessary delay on permanent failures.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/maintainability concerns with no present-defect risk.

Single file change, core retry logic is correct, happy path unchanged, both helpers behave consistently. Remaining comments are non-blocking quality suggestions.

No files require special attention.

Important Files Changed

Filename Overview
apps/cli/src/api/session/sessionWritesBestEffort.ts Adds a withRetry helper (3 attempts, 1 s / 2 s backoff) and wires both best-effort write helpers into it; logic is correct for current constants, two minor P2 maintainability concerns around delay-array sizing and blanket retry on all errors.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant W as withRetry
    participant S as session.update*

    C->>W: void withRetry(fn, onFailure)
    loop attempt 1..3
        W->>S: fn() [updateAgentState / updateMetadata]
        alt success
            S-->>W: resolved
            W-->>C: return (fire-and-forget)
        else failure
            S-->>W: throws / rejects
            W->>W: onFailure(error, attempt, isFinal)
            alt not final attempt
                W->>W: await delay (1 s then 2 s)
            end
        end
    end
    Note over W: after 3 failures: log final debug, give up
Loading

Reviews (1): Last reviewed commit: "fix: retry metadata/agent-state writes o..." | Re-trigger Greptile

Comment thread apps/cli/src/api/session/sessionWritesBestEffort.ts Outdated
Comment thread apps/cli/src/api/session/sessionWritesBestEffort.ts
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/cli/src/api/session/sessionWritesBestEffort.ts (1)

4-5: Couple the delay schedule to the attempt count to avoid silent drift.

BEST_EFFORT_RETRY_DELAYS_MS must have exactly BEST_EFFORT_MAX_ATTEMPTS - 1 entries for the indexing at line 20 to remain correct. If someone later bumps BEST_EFFORT_MAX_ATTEMPTS to 4, BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1] silently becomes undefined and setTimeout(resolve, undefined) collapses to a ~1 ms wait rather than the intended backoff — a subtle regression. Derive one from the other, or assert the invariant.

♻️ Suggested refactor
-const BEST_EFFORT_MAX_ATTEMPTS = 3;
-const BEST_EFFORT_RETRY_DELAYS_MS = [1_000, 2_000];
+const BEST_EFFORT_RETRY_DELAYS_MS = [1_000, 2_000] as const;
+const BEST_EFFORT_MAX_ATTEMPTS = BEST_EFFORT_RETRY_DELAYS_MS.length + 1;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/cli/src/api/session/sessionWritesBestEffort.ts` around lines 4 - 5,
BEST_EFFORT_RETRY_DELAYS_MS is brittle because BEST_EFFORT_MAX_ATTEMPTS can
change while the fixed array length must be exactly BEST_EFFORT_MAX_ATTEMPTS - 1
for the code that uses BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]; replace the
hardcoded array with a derived array (e.g., generate an exponential/backoff list
of length BEST_EFFORT_MAX_ATTEMPTS - 1 from a base delay) or add a runtime
assert that throws if BEST_EFFORT_RETRY_DELAYS_MS.length !==
BEST_EFFORT_MAX_ATTEMPTS - 1; update the constants at the top
(BEST_EFFORT_MAX_ATTEMPTS and BEST_EFFORT_RETRY_DELAYS_MS) so the generated
delays are always in sync with the attempts used by the retry logic that indexes
by attempt - 1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/cli/src/api/session/sessionWritesBestEffort.ts`:
- Around line 7-25: withRetry currently leaves Node timers referenced and has no
cancellation, so pending retries can keep the process alive and waste attempts;
modify withRetry to accept an optional AbortSignal parameter (e.g.,
withRetry(fn, onFailure, signal?: AbortSignal)), check signal.aborted before
each attempt and treat an abort as an immediate stop (call onFailure with
isFinal true or simply return), and when scheduling the retry use a reference to
the timer (const t = setTimeout(...)) and call t.unref() so retries don't keep
the event loop alive; additionally, if a signal is provided, listen for
signal.onabort to clearTimeout(t) and short-circuit remaining retries.

---

Nitpick comments:
In `@apps/cli/src/api/session/sessionWritesBestEffort.ts`:
- Around line 4-5: BEST_EFFORT_RETRY_DELAYS_MS is brittle because
BEST_EFFORT_MAX_ATTEMPTS can change while the fixed array length must be exactly
BEST_EFFORT_MAX_ATTEMPTS - 1 for the code that uses
BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]; replace the hardcoded array with a
derived array (e.g., generate an exponential/backoff list of length
BEST_EFFORT_MAX_ATTEMPTS - 1 from a base delay) or add a runtime assert that
throws if BEST_EFFORT_RETRY_DELAYS_MS.length !== BEST_EFFORT_MAX_ATTEMPTS - 1;
update the constants at the top (BEST_EFFORT_MAX_ATTEMPTS and
BEST_EFFORT_RETRY_DELAYS_MS) so the generated delays are always in sync with the
attempts used by the retry logic that indexes by attempt - 1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2baa2312-53a6-4124-bc02-83e993bc6c2b

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee9219 and 64c2286.

📒 Files selected for processing (1)
  • apps/cli/src/api/session/sessionWritesBestEffort.ts

Comment thread apps/cli/src/api/session/sessionWritesBestEffort.ts
…tempts

- Derive BEST_EFFORT_MAX_ATTEMPTS from BEST_EFFORT_RETRY_DELAYS_MS.length + 1
  so the two constants can't silently drift out of sync
- unref() the retry timer so pending best-effort retries never hold the Node
  process open past daemon shutdown
- Add JSDoc to withRetry, updateAgentStateBestEffort, updateMetadataBestEffort
  to satisfy docstring coverage requirement; document intentional retry-all
  behaviour
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant