Skip to content

fix(worker): recover stale DB connections after lock acquisition#753

Draft
thomasrockhu-codecov wants to merge 4 commits intomainfrom
fix/finisher-lock-timeout-and-stale-db
Draft

fix(worker): recover stale DB connections after lock acquisition#753
thomasrockhu-codecov wants to merge 4 commits intomainfrom
fix/finisher-lock-timeout-and-stale-db

Conversation

@thomasrockhu-codecov
Copy link
Contributor

Summary

After waiting up to 30s for a Redis lock (FINISHER_BLOCKING_TIMEOUT_SECONDS), PgBouncer or the DB server may have closed the idle connection. The previous approach (db_session.rollback() unconditionally after lock acquisition) broke the session by expunging objects — causing Instance is not persistent within this Session errors in CI and likely in production.

This PR adds _ping_db() — a lightweight SELECT 1 after acquiring each lock. If the connection is dead, it catches OperationalError, rolls back (discarding the dead connection), and lets the next query get a fresh one from the pool. If the connection is healthy (the common case, and always in tests), it's a no-op.

What changed

Site Before After
_process_reports_with_lock db_session.refresh(commit) crashes with psycopg2.OperationalError if connection is stale _ping_db()db_session.refresh(commit) recovers gracefully
_handle_finisher_lock finish_reports_processing() crashes with psycopg2.OperationalError _ping_db()finish_reports_processing() recovers gracefully

Why the previous fix failed

db_session.rollback() unconditionally after lock acquisition expunges objects that aren't yet committed from the SQLAlchemy session (e.g., test-created objects). This caused Instance '<Commit at ...>' is not persistent within this Session errors in CI tests and potentially in production.

Why this fix works

  • SELECT 1 surfaces a dead connection without side effects on a healthy one
  • OperationalError catch is narrow — only triggers on actual connection failures
  • rollback() only runs when the connection is already dead, so no healthy session objects are expunged
  • In tests, connections are always fresh, so the except block never executes

Test plan

  • test_ping_db_recovers_stale_connection — simulates OperationalError on execute, verifies rollback is called
  • test_ping_db_noop_on_healthy_connection — verifies no rollback on healthy connection
  • Existing tests pass (no session expunge issues)

Made with Cursor

…quisition

Replace blocking_timeout=None with DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s)
in both LockManager instantiations in upload_finisher. This activates the
existing LockRetry backoff mechanism that was dead code -- tasks now fail
fast and retry via Celery instead of blocking worker threads indefinitely
(up to the 600s soft time limit).

Remove max_retries from locked() calls to disable the shared Redis
counter that incorrectly kills first-attempt tasks when concurrent
tasks fail. Per-task retry limiting via self._has_exceeded_max_attempts()
remains as the sole guard.

Add db_session.rollback() after each lock acquisition to force SQLAlchemy
to release potentially stale connections before any DB operations. This
fixes psycopg2.OperationalError crashes from idle connections that go
stale during the lock wait.

Made-with: Cursor
rollback() expunges objects from the session, causing
"Instance is not persistent within this Session" errors on the
subsequent refresh/query calls. With blocking_timeout=5s, connections
won't go stale during the short lock wait, so the rollback is
unnecessary.

Made-with: Cursor
…own=10s

The default 5s blocking_timeout is shorter than typical lock hold times
(20-60s for large repos), causing unnecessary retry bounces that add
100-200s latency. 30s lets the next waiter pick up the lock immediately
in the common case.

The default 200s base_retry_countdown is far too slow when the lock is
genuinely contested. 10s matches bundle-analysis settings and keeps
recovery fast.

Made-with: Cursor
After waiting up to 30s for a Redis lock, PgBouncer may have closed the
idle DB connection. Add _ping_db() that does SELECT 1 after acquiring
the lock — if the connection is dead, rollback discards it and the next
query gets a fresh one from the pool. Unlike the previous
db_session.rollback() approach, this only rolls back when the connection
is actually stale, so it never expunges healthy session objects.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant