Skip to content

fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback#13495

Closed
VTRiot wants to merge 2 commits intoNousResearch:mainfrom
VTRiot:fix/cron-cancel-orphan-coroutine
Closed

fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback#13495
VTRiot wants to merge 2 commits intoNousResearch:mainfrom
VTRiot:fix/cron-cancel-orphan-coroutine

Conversation

@VTRiot
Copy link
Copy Markdown
Contributor

@VTRiot VTRiot commented Apr 21, 2026

What does this PR do?

In cron/scheduler.py, two places schedule a coroutine on the gateway event loop via asyncio.run_coroutine_threadsafe() and then block with future.result(timeout=N):

  1. _deliver_result() — the main cron delivery path (timeout=60), which falls back to a standalone delivery path on timeout.
  2. _send_media_via_adapter() — the media send path (timeout=30).

When future.result() raises TimeoutError, the original coroutine is not cancelled — it may still complete on the event loop afterward.

For _deliver_result(), this causes a concrete duplicate-delivery bug: the live-adapter send times out, the standalone fallback runs and succeeds, and the original coroutine eventually also succeeds → the user receives the same cron output twice.

This PR wraps both future.result(timeout=N) calls with a try/except TimeoutError: future.cancel(); raise block. The existing error handling that calls the standalone fallback is untouched — raise lets the TimeoutError propagate to the surrounding except Exception clause exactly as before, but with the orphan coroutine now properly cancelled.

Why this matters more now

After #13021 (fix(cron): run due jobs in parallel to prevent serial tick starvation), multiple cron jobs run concurrently through ThreadPoolExecutor. The chance of multiple in-flight delivery coroutines sharing the same event loop has increased, which also increases the exposure to this orphan-coroutine bug. Cancelling the future on timeout is a small but proactive fix that complements the parallelisation.

Changes Made

  • cron/scheduler.py
    • _deliver_result() (timeout=60): wrap future.result() in try/except TimeoutError: future.cancel(); raise.
    • _send_media_via_adapter() (timeout=30): same pattern for orphan-coroutine prevention on the media send path.
  • tests/cron/test_scheduler.py
    • Add TestDeliverResultTimeoutCancelsFuture and TestSendMediaTimeoutCancelsFuture that exercise the try/except/cancel/raise pattern and assert future.cancel() is invoked.
  • scripts/release.py
    • Register author entry in AUTHOR_MAP (separate commit chore: register VTRiot in AUTHOR_MAP).

How to Test

Reproduction scenario (conceptual)

  1. Configure a cron job with Telegram (or any live-adapter) delivery.
  2. Introduce high latency (>60s) on the delivery path (e.g. network throttling, slow Bot API).
  3. The live-adapter future.result(timeout=60) raises TimeoutError.
  4. The standalone fallback runs and delivers successfully.
  5. The original coroutine eventually completes on the event loop and also delivers.
  6. Result before this PR: the user receives the cron output twice.

After this fix

  • On TimeoutError, future.cancel() aborts the in-flight coroutine before the standalone fallback runs.
  • Only the standalone fallback delivers the message.
  • No duplicate.

Automated tests

pytest tests/cron/test_scheduler.py::TestDeliverResultTimeoutCancelsFuture \
       tests/cron/test_scheduler.py::TestSendMediaTimeoutCancelsFuture -v

Both tests pass locally. Full tests/cron/test_scheduler.py suite: 82 passed / 3 pre-existing failures unrelated to this change (environment-dependent run_job integration tests requiring inference provider credentials).

Implementation detail

The pattern mirrors the existing precedent in tools/mcp_tool.py (the only other in-tree use of future.cancel() on a run_coroutine_threadsafe future):

future = asyncio.run_coroutine_threadsafe(coro, loop)
try:
    result = future.result(timeout=N)
except TimeoutError:
    future.cancel()
    raise

concurrent.futures.Future.cancel() on a future returned by asyncio.run_coroutine_threadsafe() calls asyncio.Task.cancel() on the underlying task. cancel() is idempotent: if the coroutine is already completed, it is a no-op. The re-raise preserves the outer control flow unchanged.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (plus AUTHOR_MAP registration as a separate commit)
  • I've run pytest tests/cron/test_scheduler.py -v and the suite shows no regression
  • I've added tests for my changes
  • I've tested on my platform: macOS (Apple Silicon), Python 3.14

Documentation & Housekeeping

  • N/A for all documentation items

VTRiot added 2 commits April 20, 2026 12:54
…one fallback

When the live adapter delivery path (_deliver_result) or media send path
(_send_media_via_adapter) times out at future.result(timeout=N), the
underlying coroutine scheduled via asyncio.run_coroutine_threadsafe can
still complete on the event loop, causing a duplicate send after the
standalone fallback runs.

Cancel the future on TimeoutError before re-raising, so the standalone
fallback is the sole delivery path.

Adds TestDeliverResultTimeoutCancelsFuture and
TestSendMediaTimeoutCancelsFuture.
@teknium1
Copy link
Copy Markdown
Contributor

Merged via PR #13517 — your commits (3cc4d737 chore: register VTRiot in AUTHOR_MAP, 18e7fd83 fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback) landed on main with your authorship preserved via rebase-merge.

The only change from your PR: the two timeout tests were rewritten to invoke _deliver_result and _send_media_via_adapter directly (using a real concurrent.futures.Future whose .result() raises TimeoutError) instead of replicating the try/except pattern inline against a mocked future. Mutation-verified: both rewritten tests fail when the fix is reverted. Same fix, genuinely guarded by the tests now.

Thanks @VTRiot!

@VTRiot
Copy link
Copy Markdown
Contributor Author

VTRiot commented Apr 21, 2026

Thanks for the quick merge and the test refactor — calling the actual
_deliver_result / _send_media_via_adapter via Future replacement is
cleaner than my inline reproduction. I'll use that pattern going forward.
Appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants