fix(worker): recover stale DB connections after lock acquisition by thomasrockhu-codecov · Pull Request #753 · codecov/umbrella

thomasrockhu-codecov · 2026-03-10T22:38:16Z

Summary

After waiting up to 30s for a Redis lock (FINISHER_BLOCKING_TIMEOUT_SECONDS), PgBouncer or the DB server may have closed the idle connection. The previous approach (db_session.rollback() unconditionally after lock acquisition) broke the session by expunging objects — causing Instance is not persistent within this Session errors in CI and likely in production.

This PR adds _ping_db() — a lightweight SELECT 1 after acquiring each lock. If the connection is dead, it catches OperationalError, rolls back (discarding the dead connection), and lets the next query get a fresh one from the pool. If the connection is healthy (the common case, and always in tests), it's a no-op.

What changed

Site	Before	After
`_process_reports_with_lock`	`db_session.refresh(commit)` crashes with `psycopg2.OperationalError` if connection is stale	`_ping_db()` → `db_session.refresh(commit)` recovers gracefully
`_handle_finisher_lock`	`finish_reports_processing()` crashes with `psycopg2.OperationalError`	`_ping_db()` → `finish_reports_processing()` recovers gracefully

Why the previous fix failed

db_session.rollback() unconditionally after lock acquisition expunges objects that aren't yet committed from the SQLAlchemy session (e.g., test-created objects). This caused Instance '<Commit at ...>' is not persistent within this Session errors in CI tests and potentially in production.

Why this fix works

SELECT 1 surfaces a dead connection without side effects on a healthy one
OperationalError catch is narrow — only triggers on actual connection failures
rollback() only runs when the connection is already dead, so no healthy session objects are expunged
In tests, connections are always fresh, so the except block never executes

Test plan

test_ping_db_recovers_stale_connection — simulates OperationalError on execute, verifies rollback is called
test_ping_db_noop_on_healthy_connection — verifies no rollback on healthy connection
Existing tests pass (no session expunge issues)

Made with Cursor

…quisition Replace blocking_timeout=None with DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s) in both LockManager instantiations in upload_finisher. This activates the existing LockRetry backoff mechanism that was dead code -- tasks now fail fast and retry via Celery instead of blocking worker threads indefinitely (up to the 600s soft time limit). Remove max_retries from locked() calls to disable the shared Redis counter that incorrectly kills first-attempt tasks when concurrent tasks fail. Per-task retry limiting via self._has_exceeded_max_attempts() remains as the sole guard. Add db_session.rollback() after each lock acquisition to force SQLAlchemy to release potentially stale connections before any DB operations. This fixes psycopg2.OperationalError crashes from idle connections that go stale during the lock wait. Made-with: Cursor

rollback() expunges objects from the session, causing "Instance is not persistent within this Session" errors on the subsequent refresh/query calls. With blocking_timeout=5s, connections won't go stale during the short lock wait, so the rollback is unnecessary. Made-with: Cursor

…own=10s The default 5s blocking_timeout is shorter than typical lock hold times (20-60s for large repos), causing unnecessary retry bounces that add 100-200s latency. 30s lets the next waiter pick up the lock immediately in the common case. The default 200s base_retry_countdown is far too slow when the lock is genuinely contested. 10s matches bundle-analysis settings and keeps recovery fast. Made-with: Cursor

After waiting up to 30s for a Redis lock, PgBouncer may have closed the idle DB connection. Add _ping_db() that does SELECT 1 after acquiring the lock — if the connection is dead, rollback discards it and the next query gets a fresh one from the pool. Unlike the previous db_session.rollback() approach, this only rolls back when the connection is actually stale, so it never expunges healthy session objects. Made-with: Cursor

thomasrockhu-codecov added 4 commits March 11, 2026 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): recover stale DB connections after lock acquisition#753

fix(worker): recover stale DB connections after lock acquisition#753
thomasrockhu-codecov wants to merge 4 commits intomainfrom
fix/finisher-lock-timeout-and-stale-db

thomasrockhu-codecov commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomasrockhu-codecov commented Mar 10, 2026

Summary

What changed

Why the previous fix failed

Why this fix works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant