Skip to content

fix(worker): cap retry countdown to visibility timeout across all tasks#747

Open
drazisil-codecov wants to merge 1 commit intomainfrom
joebecher/fix-enterprise-task-retry-clamping
Open

fix(worker): cap retry countdown to visibility timeout across all tasks#747
drazisil-codecov wants to merge 1 commit intomainfrom
joebecher/fix-enterprise-task-retry-clamping

Conversation

@drazisil-codecov
Copy link
Contributor

@drazisil-codecov drazisil-codecov commented Mar 10, 2026

Problem

Fixes CODECOV-59, WORKER-XPY, ECDN-WORKER-E65.

When a task calls self.retry(countdown=N) where N exceeds TASK_VISIBILITY_TIMEOUT_SECONDS (900s in production), Redis redelivers the message from unacked before the ETA fires. This causes duplicate task executions that compound exponentially under load — the root cause of the bundle analysis queue explosions in CODECOV-59.

A second, related bug (WORKER-XPY, ECDN-WORKER-E65): enterprise tasks in the upload config group receive hard_timelimit=2450s via apply_async(time_limit=N). The previous fix computed safe_window = 900 - 2450 = -1550, which fell below the TASK_RETRY_MIN_SAFE_WINDOW_SECONDS threshold and raised MaxRetriesExceededError, killing legitimate retries for Upload and UploadProcessor tasks.

Solution

BaseCodecovTask.clamp_retry_countdown(countdown)

A new method on BaseCodecovTask caps any retry countdown to the safe window before the visibility timeout expires:

safe_ceiling = TASK_VISIBILITY_TIMEOUT_SECONDS - hard_time_limit_task

A task may have been running for up to hard_time_limit_task seconds when it calls retry(), so the countdown must fit within the remaining visibility window to avoid redelivery.

Enterprise task fallback: when the safe window is smaller than TASK_RETRY_MIN_SAFE_WINDOW_SECONDS (e.g. hard_timelimit=2450s >> visibility_timeout=900s), the method falls back to TASK_RETRY_COUNTDOWN_MAX_SECONDS (870s) instead of raising. This restores retries for enterprise upload tasks while still keeping countdowns within the visibility window.

BaseCodecovTask.retry() override

Clamping is applied automatically in the retry() override whenever a countdown is explicitly provided. No call sites need to be updated.

LockManager

The hardcoded 5-hour cap (MAX_RETRY_COUNTDOWN_SECONDS = 60 * 60 * 5) is replaced with the shared TASK_RETRY_COUNTDOWN_MAX_SECONDS constant, so LockRetry countdowns are also bounded at the source.

New constants (shared/celery_config.py)

Constant Value Purpose
TASK_RETRY_COUNTDOWN_MAX_SECONDS TASK_VISIBILITY_TIMEOUT_SECONDS - 30 (870s) Fallback ceiling for tasks with no hard limit or an oversized one
TASK_RETRY_MIN_SAFE_WINDOW_SECONDS 30 Threshold below which we fall back to the global max

Files changed

  • libs/shared/shared/celery_config.py — two new constants
  • apps/worker/tasks/base.pyhard_time_limit_task return type, clamp_retry_countdown, retry() override
  • apps/worker/services/lock_manager.py — replace local 5-hour cap with shared constant
  • apps/worker/services/tests/test_lock_manager.py — update cap assertions
  • apps/worker/tasks/tests/unit/test_base.pyTestClampRetryCountdown covering normal cap, fallback, enterprise tasks, retry() integration
  • 6 task files — formatting only (single-line self.retry() → multi-line)

Note

Medium Risk
Changes core retry behavior for all BaseCodecovTask tasks by clamping countdown, which could alter retry timing under load or for long-running tasks; includes explicit fallback behavior and new unit coverage to reduce regression risk.

Overview
Adds a global retry-delay ceiling derived from TASK_VISIBILITY_TIMEOUT_SECONDS (TASK_RETRY_COUNTDOWN_MAX_SECONDS) plus a minimum-safe-window threshold, and uses these to prevent Redis from redelivering ETA retries before they fire.

Updates BaseCodecovTask.retry() to automatically clamp any explicit countdown based on the task hard time limit (with a fallback when hard limits exceed the visibility window), and aligns LockManager’s lock-acquisition backoff cap to the same shared limit. Tests are expanded/updated to validate clamping, misconfiguration fallback, and the new lock backoff cap; remaining task-file changes are formatting-only around self.retry() calls.

Written by Cursor Bugbot for commit 18d106e. This will update automatically on new commits. Configure here.

Refs CODECOV-59, WORKER-XPY, ECDN-WORKER-E65

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@linear
Copy link

linear bot commented Mar 10, 2026

@sentry
Copy link

sentry bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.25%. Comparing base (255bc1c) to head (18d106e).
⚠️ Report is 8 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #747   +/-   ##
=======================================
  Coverage   92.25%   92.25%           
=======================================
  Files        1303     1303           
  Lines       47917    47929   +12     
  Branches     1628     1628           
=======================================
+ Hits        44204    44216   +12     
  Misses       3404     3404           
  Partials      309      309           
Flag Coverage Δ
apiunit 96.36% <ø> (ø)
sharedintegration 37.00% <100.00%> (+<0.01%) ⬆️
sharedunit 84.90% <100.00%> (+<0.01%) ⬆️
workerintegration 58.56% <72.72%> (+0.01%) ⬆️
workerunit 90.34% <90.90%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 10, 2026

Merging this PR will not alter performance

✅ 9 untouched benchmarks


Comparing joebecher/fix-enterprise-task-retry-clamping (18d106e) with main (3b362b0)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (255bc1c) during the generation of this report, so 3b362b0 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant