fix: retry metadata/agent-state writes on transient network failure by eusip · Pull Request #159 · happier-dev/happier

eusip · 2026-04-23T15:48:14Z

Problem

updateMetadataBestEffort and updateAgentStateBestEffort make a single fire-and-forget API call. On mobile or flaky connections a transient network error silently drops the write. The most visible consequence is that claudeSessionId (and equivalent vendor session IDs for other agents) may never be persisted in the session metadata, making the session permanently non-resumable — the eligibility check returns reasonCode: 'vendor_resume_id_missing' and the UI correctly hides the resume option.

Fix

Introduce a shared withRetry helper that retries up to 3 times (immediate → +1 s → +2 s backoff) before giving up. Both updateMetadataBestEffort and updateAgentStateBestEffort now use it.

Intermediate failures log at debug level with attempt count
Final failure logs at debug level with total attempts, same as before
Success path is unchanged — no extra delay on the happy path
No schema or protocol changes

Why both functions

updateAgentStateBestEffort has the same fire-and-forget pattern and would benefit from the same resilience. Fixing both keeps the two helpers consistent and prevents the same class of bug from being rediscovered later.

Scope

Single file change (sessionWritesBestEffort.ts). No new dependencies.

Summary by CodeRabbit

Bug Fixes
- Improved resilience of best-effort write operations with automatic retry logic. Operations now attempt up to three times with delays between failed attempts, increasing the likelihood of successful completion under temporary failures.
- Enhanced debugging information to better distinguish between intermediate retry failures and final failures after exhausting all retry attempts, improving troubleshooting capabilities.

updateMetadataBestEffort and updateAgentStateBestEffort previously made a single fire-and-forget attempt. On flaky or mobile connections a transient error would silently swallow the write, leaving the session without a persisted vendorResumeId (e.g. claudeSessionId) and making the session permanently non-resumable. Change both helpers to retry up to 3 times (immediate → +1 s → +2 s) before giving up. Failures on intermediate attempts are logged at debug level; only the final failure is flagged. Behaviour on success is unchanged.

coderabbitai · 2026-04-23T15:48:41Z

Walkthrough

Introduces a shared retry mechanism (withRetry) to replace immediate try/catch handling in session write best-effort operations. The mechanism attempts up to 3 retries with configurable delays between attempts, while improving logging to distinguish intermediate failures from final failures after exhausting retries.

Changes

Cohort / File(s)	Summary
Best-Effort Write Retry Logic `apps/cli/src/api/session/sessionWritesBestEffort.ts`	Refactors `updateAgentStateBestEffort` and `updateMetadataBestEffort` to use a new `withRetry` wrapper. Implements up to 3 retry attempts with delays from `BEST_EFFORT_RETRY_DELAYS_MS` between failures. Enhanced debug logging differentiates intermediate retry failures from final exhaustion of retries. Exported function signatures remain unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding retry logic for metadata and agent-state writes to handle transient network failures.
Description check	✅ Passed	The description includes comprehensive sections covering Problem, Fix, Why both functions, and Scope, addressing the template's Summary and Why requirements thoroughly, though formal How to test steps are not explicitly numbered.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-04-23T15:51:30Z

Greptile Summary

This PR wraps both updateAgentStateBestEffort and updateMetadataBestEffort in a new withRetry helper that retries up to 3 times (0 s → 1 s → 2 s backoff) before giving up, addressing silent write drops on flaky connections that could leave claudeSessionId un-persisted and sessions permanently non-resumable.

The implementation is correct for the current constants: the two-element delay array is indexed only for non-final attempts (indices 0 and 1), so no out-of-bounds access occurs today. Two minor P2 notes: (1) the delay array size is implicitly coupled to BEST_EFFORT_MAX_ATTEMPTS — if one changes without the other, retries silently become 0 ms; (2) all error types are retried, not just transient ones, adding up to 3 s of unnecessary delay on permanent failures.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/maintainability concerns with no present-defect risk.

Single file change, core retry logic is correct, happy path unchanged, both helpers behave consistently. Remaining comments are non-blocking quality suggestions.

No files require special attention.

Important Files Changed

Filename	Overview
apps/cli/src/api/session/sessionWritesBestEffort.ts	Adds a `withRetry` helper (3 attempts, 1 s / 2 s backoff) and wires both best-effort write helpers into it; logic is correct for current constants, two minor P2 maintainability concerns around delay-array sizing and blanket retry on all errors.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant W as withRetry
    participant S as session.update*

    C->>W: void withRetry(fn, onFailure)
    loop attempt 1..3
        W->>S: fn() [updateAgentState / updateMetadata]
        alt success
            S-->>W: resolved
            W-->>C: return (fire-and-forget)
        else failure
            S-->>W: throws / rejects
            W->>W: onFailure(error, attempt, isFinal)
            alt not final attempt
                W->>W: await delay (1 s then 2 s)
            end
        end
    end
    Note over W: after 3 failures: log final debug, give up

_{Reviews (1): Last reviewed commit: "fix: retry metadata/agent-state writes o..." | Re-trigger Greptile}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

apps/cli/src/api/session/sessionWritesBestEffort.ts (1)
4-5: Couple the delay schedule to the attempt count to avoid silent drift.

BEST_EFFORT_RETRY_DELAYS_MS must have exactly BEST_EFFORT_MAX_ATTEMPTS - 1 entries for the indexing at line 20 to remain correct. If someone later bumps BEST_EFFORT_MAX_ATTEMPTS to 4, BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1] silently becomes undefined and setTimeout(resolve, undefined) collapses to a ~1 ms wait rather than the intended backoff — a subtle regression. Derive one from the other, or assert the invariant.
♻️ Suggested refactor
-const BEST_EFFORT_MAX_ATTEMPTS = 3;
-const BEST_EFFORT_RETRY_DELAYS_MS = [1_000, 2_000];
+const BEST_EFFORT_RETRY_DELAYS_MS = [1_000, 2_000] as const;
+const BEST_EFFORT_MAX_ATTEMPTS = BEST_EFFORT_RETRY_DELAYS_MS.length + 1;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/cli/src/api/session/sessionWritesBestEffort.ts` around lines 4 - 5,
BEST_EFFORT_RETRY_DELAYS_MS is brittle because BEST_EFFORT_MAX_ATTEMPTS can
change while the fixed array length must be exactly BEST_EFFORT_MAX_ATTEMPTS - 1
for the code that uses BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]; replace the
hardcoded array with a derived array (e.g., generate an exponential/backoff list
of length BEST_EFFORT_MAX_ATTEMPTS - 1 from a base delay) or add a runtime
assert that throws if BEST_EFFORT_RETRY_DELAYS_MS.length !==
BEST_EFFORT_MAX_ATTEMPTS - 1; update the constants at the top
(BEST_EFFORT_MAX_ATTEMPTS and BEST_EFFORT_RETRY_DELAYS_MS) so the generated
delays are always in sync with the attempts used by the retry logic that indexes
by attempt - 1.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/cli/src/api/session/sessionWritesBestEffort.ts`:
- Around line 7-25: withRetry currently leaves Node timers referenced and has no
cancellation, so pending retries can keep the process alive and waste attempts;
modify withRetry to accept an optional AbortSignal parameter (e.g.,
withRetry(fn, onFailure, signal?: AbortSignal)), check signal.aborted before
each attempt and treat an abort as an immediate stop (call onFailure with
isFinal true or simply return), and when scheduling the retry use a reference to
the timer (const t = setTimeout(...)) and call t.unref() so retries don't keep
the event loop alive; additionally, if a signal is provided, listen for
signal.onabort to clearTimeout(t) and short-circuit remaining retries.

---

Nitpick comments:
In `@apps/cli/src/api/session/sessionWritesBestEffort.ts`:
- Around line 4-5: BEST_EFFORT_RETRY_DELAYS_MS is brittle because
BEST_EFFORT_MAX_ATTEMPTS can change while the fixed array length must be exactly
BEST_EFFORT_MAX_ATTEMPTS - 1 for the code that uses
BEST_EFFORT_RETRY_DELAYS_MS[attempt - 1]; replace the hardcoded array with a
derived array (e.g., generate an exponential/backoff list of length
BEST_EFFORT_MAX_ATTEMPTS - 1 from a base delay) or add a runtime assert that
throws if BEST_EFFORT_RETRY_DELAYS_MS.length !== BEST_EFFORT_MAX_ATTEMPTS - 1;
update the constants at the top (BEST_EFFORT_MAX_ATTEMPTS and
BEST_EFFORT_RETRY_DELAYS_MS) so the generated delays are always in sync with the
attempts used by the retry logic that indexes by attempt - 1.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2baa2312-53a6-4124-bc02-83e993bc6c2b

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee9219 and 64c2286.

📒 Files selected for processing (1)

apps/cli/src/api/session/sessionWritesBestEffort.ts

…tempts - Derive BEST_EFFORT_MAX_ATTEMPTS from BEST_EFFORT_RETRY_DELAYS_MS.length + 1 so the two constants can't silently drift out of sync - unref() the retry timer so pending best-effort retries never hold the Node process open past daemon shutdown - Add JSDoc to withRetry, updateAgentStateBestEffort, updateMetadataBestEffort to satisfy docstring coverage requirement; document intentional retry-all behaviour

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread apps/cli/src/api/session/sessionWritesBestEffort.ts Outdated

Comment thread apps/cli/src/api/session/sessionWritesBestEffort.ts

coderabbitai Bot requested changes Apr 23, 2026

View reviewed changes

Comment thread apps/cli/src/api/session/sessionWritesBestEffort.ts

coderabbitai Bot approved these changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: retry metadata/agent-state writes on transient network failure#159

fix: retry metadata/agent-state writes on transient network failure#159
eusip wants to merge 2 commits intohappier-dev:devfrom
eusip:fix/metadata-write-retry

eusip commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Apr 23, 2026

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

eusip commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Why both functions

Scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eusip commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading