fix(worker): cap retry countdown to visibility timeout across all tasks#747
Open
drazisil-codecov wants to merge 1 commit intomainfrom
Open
fix(worker): cap retry countdown to visibility timeout across all tasks#747drazisil-codecov wants to merge 1 commit intomainfrom
drazisil-codecov wants to merge 1 commit intomainfrom
Conversation
Refs CODECOV-59, WORKER-XPY, ECDN-WORKER-E65 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #747 +/- ##
=======================================
Coverage 92.25% 92.25%
=======================================
Files 1303 1303
Lines 47917 47929 +12
Branches 1628 1628
=======================================
+ Hits 44204 44216 +12
Misses 3404 3404
Partials 309 309
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes CODECOV-59, WORKER-XPY, ECDN-WORKER-E65.
When a task calls
self.retry(countdown=N)whereNexceedsTASK_VISIBILITY_TIMEOUT_SECONDS(900s in production), Redis redelivers the message fromunackedbefore the ETA fires. This causes duplicate task executions that compound exponentially under load — the root cause of the bundle analysis queue explosions in CODECOV-59.A second, related bug (WORKER-XPY, ECDN-WORKER-E65): enterprise tasks in the
uploadconfig group receivehard_timelimit=2450sviaapply_async(time_limit=N). The previous fix computedsafe_window = 900 - 2450 = -1550, which fell below theTASK_RETRY_MIN_SAFE_WINDOW_SECONDSthreshold and raisedMaxRetriesExceededError, killing legitimate retries forUploadandUploadProcessortasks.Solution
BaseCodecovTask.clamp_retry_countdown(countdown)A new method on
BaseCodecovTaskcaps any retry countdown to the safe window before the visibility timeout expires:A task may have been running for up to
hard_time_limit_taskseconds when it callsretry(), so the countdown must fit within the remaining visibility window to avoid redelivery.Enterprise task fallback: when the safe window is smaller than
TASK_RETRY_MIN_SAFE_WINDOW_SECONDS(e.g.hard_timelimit=2450s >> visibility_timeout=900s), the method falls back toTASK_RETRY_COUNTDOWN_MAX_SECONDS(870s) instead of raising. This restores retries for enterprise upload tasks while still keeping countdowns within the visibility window.BaseCodecovTask.retry()overrideClamping is applied automatically in the
retry()override whenever acountdownis explicitly provided. No call sites need to be updated.LockManagerThe hardcoded 5-hour cap (
MAX_RETRY_COUNTDOWN_SECONDS = 60 * 60 * 5) is replaced with the sharedTASK_RETRY_COUNTDOWN_MAX_SECONDSconstant, so LockRetry countdowns are also bounded at the source.New constants (
shared/celery_config.py)TASK_RETRY_COUNTDOWN_MAX_SECONDSTASK_VISIBILITY_TIMEOUT_SECONDS - 30(870s)TASK_RETRY_MIN_SAFE_WINDOW_SECONDS30Files changed
libs/shared/shared/celery_config.py— two new constantsapps/worker/tasks/base.py—hard_time_limit_taskreturn type,clamp_retry_countdown,retry()overrideapps/worker/services/lock_manager.py— replace local 5-hour cap with shared constantapps/worker/services/tests/test_lock_manager.py— update cap assertionsapps/worker/tasks/tests/unit/test_base.py—TestClampRetryCountdowncovering normal cap, fallback, enterprise tasks,retry()integrationself.retry()→ multi-line)Note
Medium Risk
Changes core retry behavior for all
BaseCodecovTasktasks by clampingcountdown, which could alter retry timing under load or for long-running tasks; includes explicit fallback behavior and new unit coverage to reduce regression risk.Overview
Adds a global retry-delay ceiling derived from
TASK_VISIBILITY_TIMEOUT_SECONDS(TASK_RETRY_COUNTDOWN_MAX_SECONDS) plus a minimum-safe-window threshold, and uses these to prevent Redis from redelivering ETA retries before they fire.Updates
BaseCodecovTask.retry()to automatically clamp any explicitcountdownbased on the task hard time limit (with a fallback when hard limits exceed the visibility window), and alignsLockManager’s lock-acquisition backoff cap to the same shared limit. Tests are expanded/updated to validate clamping, misconfiguration fallback, and the new lock backoff cap; remaining task-file changes are formatting-only aroundself.retry()calls.Written by Cursor Bugbot for commit 18d106e. This will update automatically on new commits. Configure here.