fix(worker): recover stale DB connections after lock acquisition#753
Draft
thomasrockhu-codecov wants to merge 4 commits intomainfrom
Draft
fix(worker): recover stale DB connections after lock acquisition#753thomasrockhu-codecov wants to merge 4 commits intomainfrom
thomasrockhu-codecov wants to merge 4 commits intomainfrom
Conversation
…quisition Replace blocking_timeout=None with DEFAULT_BLOCKING_TIMEOUT_SECONDS (5s) in both LockManager instantiations in upload_finisher. This activates the existing LockRetry backoff mechanism that was dead code -- tasks now fail fast and retry via Celery instead of blocking worker threads indefinitely (up to the 600s soft time limit). Remove max_retries from locked() calls to disable the shared Redis counter that incorrectly kills first-attempt tasks when concurrent tasks fail. Per-task retry limiting via self._has_exceeded_max_attempts() remains as the sole guard. Add db_session.rollback() after each lock acquisition to force SQLAlchemy to release potentially stale connections before any DB operations. This fixes psycopg2.OperationalError crashes from idle connections that go stale during the lock wait. Made-with: Cursor
rollback() expunges objects from the session, causing "Instance is not persistent within this Session" errors on the subsequent refresh/query calls. With blocking_timeout=5s, connections won't go stale during the short lock wait, so the rollback is unnecessary. Made-with: Cursor
…own=10s The default 5s blocking_timeout is shorter than typical lock hold times (20-60s for large repos), causing unnecessary retry bounces that add 100-200s latency. 30s lets the next waiter pick up the lock immediately in the common case. The default 200s base_retry_countdown is far too slow when the lock is genuinely contested. 10s matches bundle-analysis settings and keeps recovery fast. Made-with: Cursor
After waiting up to 30s for a Redis lock, PgBouncer may have closed the idle DB connection. Add _ping_db() that does SELECT 1 after acquiring the lock — if the connection is dead, rollback discards it and the next query gets a fresh one from the pool. Unlike the previous db_session.rollback() approach, this only rolls back when the connection is actually stale, so it never expunges healthy session objects. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After waiting up to 30s for a Redis lock (
FINISHER_BLOCKING_TIMEOUT_SECONDS), PgBouncer or the DB server may have closed the idle connection. The previous approach (db_session.rollback()unconditionally after lock acquisition) broke the session by expunging objects — causingInstance is not persistent within this Sessionerrors in CI and likely in production.This PR adds
_ping_db()— a lightweightSELECT 1after acquiring each lock. If the connection is dead, it catchesOperationalError, rolls back (discarding the dead connection), and lets the next query get a fresh one from the pool. If the connection is healthy (the common case, and always in tests), it's a no-op.What changed
_process_reports_with_lockdb_session.refresh(commit)crashes withpsycopg2.OperationalErrorif connection is stale_ping_db()→db_session.refresh(commit)recovers gracefully_handle_finisher_lockfinish_reports_processing()crashes withpsycopg2.OperationalError_ping_db()→finish_reports_processing()recovers gracefullyWhy the previous fix failed
db_session.rollback()unconditionally after lock acquisition expunges objects that aren't yet committed from the SQLAlchemy session (e.g., test-created objects). This causedInstance '<Commit at ...>' is not persistent within this Sessionerrors in CI tests and potentially in production.Why this fix works
SELECT 1surfaces a dead connection without side effects on a healthy oneOperationalErrorcatch is narrow — only triggers on actual connection failuresrollback()only runs when the connection is already dead, so no healthy session objects are expungedTest plan
test_ping_db_recovers_stale_connection— simulates OperationalError on execute, verifies rollback is calledtest_ping_db_noop_on_healthy_connection— verifies no rollback on healthy connectionMade with Cursor