Skip to content

feat(worker): schedule one upload_finisher via redis gate#757

Merged
thomasrockhu-codecov merged 26 commits intotomhu/finisher-source-of-truthfrom
tomhu/single-finisher-gate
Mar 11, 2026
Merged

feat(worker): schedule one upload_finisher via redis gate#757
thomasrockhu-codecov merged 26 commits intotomhu/finisher-source-of-truthfrom
tomhu/single-finisher-gate

Conversation

@thomasrockhu-codecov
Copy link
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented Mar 11, 2026

Summary

  • add commit-level Redis gate (SET NX EX) in processor completion so only one finisher task is enqueued per commit
  • keep processor behavior idempotent by skipping enqueue when gate already exists
  • switch coverage scheduling away from chord finisher callback fanout to direct processor scheduling

Test plan

  • ruff check --fix apps/worker/services/processing/processing.py apps/worker/services/tests/test_processing.py apps/worker/tasks/upload.py apps/worker/tasks/tests/unit/test_upload_task.py
  • ruff format apps/worker/services/processing/processing.py apps/worker/services/tests/test_processing.py apps/worker/tasks/upload.py apps/worker/tasks/tests/unit/test_upload_task.py
  • run worker pytest suite in CI

Stacked on #756.

Made with Cursor


Note

Medium Risk
Changes core upload processing/finishing orchestration (Redis gating, task scheduling, and finisher retry behavior), which can affect merge/notification timing and failure modes if the gate or sweep logic is incorrect.

Overview
Adds a commit-level Redis SET NX EX gate in process_upload so only the first worker that finishes the last needed upload enqueues upload_finisher, passing through checkpoint kwargs when present.

Refactors coverage upload scheduling in tasks/upload.py to stop using a Celery chord callback and instead apply_async each upload_processor task directly, relying on the new gated finisher trigger.

Enhances UploadFinisherTask to manage the gate lifecycle and handle stragglers: it can reschedule itself as a timed “sweep” when coverage uploads are still pending (up to FINISHER_MAX_SWEEP_ATTEMPTS), and ensures the gate key is deleted on completion, lock exhaustion, timeout, or unexpected errors; tests are updated accordingly.

Written by Cursor Bugbot for commit 9add94b. This will update automatically on new commits. Configure here.

@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from bebfe63 to c07692a Compare March 11, 2026 14:21
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from c07692a to fea2836 Compare March 11, 2026 14:28
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from 73b8ff6 to 44de224 Compare March 11, 2026 14:56
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from 44de224 to bf5da1c Compare March 11, 2026 15:06
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from 5651fd6 to b508cb4 Compare March 11, 2026 15:35
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from 7e12f77 to c1c5f19 Compare March 11, 2026 15:48
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from c1c5f19 to e9dbaec Compare March 11, 2026 15:51
Use a commit-level SET NX gate to enqueue at most one upload_finisher task from processors and remove chord callback fanout in coverage scheduling.

Made-with: Cursor
Only attempt finisher enqueue when merge conditions are met, and normalize schedule_task return handling so non-coverage task paths no longer call as_tuple on AsyncResult.

Made-with: Cursor
Restore processor-side gate acquisition behavior and use explicit kwargs dispatch for coverage processor apply_async to keep task scheduling and tests consistent.

Made-with: Cursor
Return processor task IDs as tuples directly and simplify task-id normalization logic to avoid a single-use wrapper class.

Made-with: Cursor
Restore the cleanup branch state and normalize scheduled task IDs through an as_tuple path when available before applying generic fallbacks.

Made-with: Cursor
Remove the helper wrapper for scheduled task ID normalization and use direct as_tuple/id handling at call sites for clearer control flow.

Made-with: Cursor
Keep scheduled_tasks_list and scheduled_tasks handling in the original call-site shape while preserving direct as_tuple/id normalization without the removed helper.

Made-with: Cursor
Use the same direct as_tuple handling as main for chunked and non-chunked scheduled tasks in UploadTask.run_impl.

Made-with: Cursor
Use a Celery group result in coverage scheduling so run_impl's as_tuple handling matches main and avoids tuple-as_tuple runtime failures.

Made-with: Cursor
Keep direct processor apply_async dispatch while returning a Celery ResultSet so existing as_tuple callsites and tests both work.

Made-with: Cursor
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from 9bce158 to 2d657c4 Compare March 11, 2026 16:14
Keep direct per-upload apply_async calls and return an as_tuple-compatible object of task IDs to match existing run_impl behavior.

Made-with: Cursor
Use group(...).apply_async().results to collect processor task IDs while preserving as_tuple-compatible return shape and avoiding per-task direct apply_async dispatch in run path.

Made-with: Cursor
Pass finisher kwargs through UploadFlow.save_to_kwargs when processor-side gate enqueues the finisher so downstream flow checkpoint timing can be preserved.

Made-with: Cursor
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the tomhu/single-finisher-gate branch from c7a3851 to 0f81b4d Compare March 11, 2026 16:36
Propagate checkpoint logger kwargs from processor arguments to finisher without calling UploadFlow.save_to_kwargs in processor context.

Made-with: Cursor
…uling

Only attempt finisher gate acquisition once all uploads for the commit are eligible to merge, and restore per-task upload processor apply_async kwargs scheduling so existing call-site/test expectations remain intact.

Made-with: Cursor
Use _kwargs_key(UploadFlow) for checkpoint passthrough from processor arguments so finisher receives the expected serialized flow payload.

Made-with: Cursor
Mock finisher enqueue in upload processing task tests and align gate-exists test expectation with should_perform_merge gating so processor-unit coverage does not depend on task router DB tables.

Made-with: Cursor
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

When coverage uploads are still pending, upload_finisher now schedules a delayed sweep instead of exiting, and it clears the commit gate key once finishing work is complete.

Made-with: Cursor
Reuse the canonical finisher gate key helper, cap sweep reschedules, preserve flow breadcrumbs during deferred sweeps, and clear gate keys on terminal failures.

Made-with: Cursor
Delete the gate and return an accurate sweep_scheduled flag when max sweep attempts are exhausted, so commits are not blocked behind stale gate keys.

Made-with: Cursor
Only clear the gate after successful finisher-lock completion so terminal lock failures do not silently drop pending commit post-processing.

Made-with: Cursor
Extend the commit gate expiration whenever a sweep is scheduled so the gate remains valid throughout long-running sweep recovery loops.

Made-with: Cursor
…ustion

Delete the finisher gate even when _handle_finisher_lock returns None (terminal lock exhaustion) and add coverage to prevent gate stalling regressions.

Made-with: Cursor
Update process_upload gate tests so redis set assertions match should_perform_merge behavior for both merge-eligible and non-eligible upload counts.

Made-with: Cursor
@sentry
Copy link

sentry bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 85.24590% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.25%. Comparing base (b4a0fcd) to head (9add94b).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/upload_finisher.py 77.14% 8 Missing ⚠️
apps/worker/services/processing/processing.py 95.23% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (85.24%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@                        Coverage Diff                         @@
##           tomhu/finisher-source-of-truth     #757      +/-   ##
==================================================================
- Coverage                           92.26%   92.25%   -0.01%     
==================================================================
  Files                                1304     1304              
  Lines                               47923    47973      +50     
  Branches                             1628     1628              
==================================================================
+ Hits                                44214    44259      +45     
- Misses                               3400     3405       +5     
  Partials                              309      309              
Flag Coverage Δ
workerintegration 58.70% <62.29%> (+0.01%) ⬆️
workerunit 90.36% <85.24%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 85.24590% with 9 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/upload_finisher.py 77.14% 8 Missing ⚠️
apps/worker/services/processing/processing.py 95.23% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

for task_result in scheduled_tasks
if task_result and task_result.id
]
return SimpleNamespace(as_tuple=lambda: tuple(scheduled_task_ids))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be celery group?

Comment on lines +59 to +60
FINISHER_SWEEP_COUNTDOWN_SECONDS = 30
FINISHER_MAX_SWEEP_ATTEMPTS = 20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphabetize

"commitid": commitid,
"commit_yaml": commit_yaml.to_dict(),
"trigger": "sweep",
"sweep_attempt": sweep_attempt + 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweep attempt shouldn't be necessary. If sweep fails, we have watchdog


repoid = int(repoid)
commit_yaml = UserYaml(commit_yaml)
sweep_attempt = int(kwargs.get("sweep_attempt", 0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary

# During rolling deploys, older workers can still persist "processed".
# Treat it as terminal here to avoid duplicate merges.
all_already_finalized = all(
upload.state in ("processed", "merged", "error")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove processed here, it won't make sense in the new code paths

Comment on lines +103 to +107
checkpoint_kwargs_key = _kwargs_key(UploadFlow)
if checkpoint_kwargs_key in arguments:
finisher_kwargs[checkpoint_kwargs_key] = arguments[
checkpoint_kwargs_key
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

for task_result in scheduled_tasks
if task_result and task_result.id
]
return SimpleNamespace(as_tuple=lambda: tuple(scheduled_task_ids))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to exist? why not use celery group?

@thomasrockhu-codecov thomasrockhu-codecov merged commit 9add94b into tomhu/finisher-source-of-truth Mar 11, 2026
35 of 37 checks passed
@thomasrockhu-codecov thomasrockhu-codecov deleted the tomhu/single-finisher-gate branch March 11, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant