Skip to content

fix: improve healthcheck#349

Merged
lindesvard merged 1 commit intomainfrom
fix/healthchecks
Apr 24, 2026
Merged

fix: improve healthcheck#349
lindesvard merged 1 commit intomainfrom
fix/healthchecks

Conversation

@lindesvard
Copy link
Copy Markdown
Contributor

@lindesvard lindesvard commented Apr 23, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Kubernetes-compatible health endpoints (/healthz/live, /healthz/ready) for better deployment integration
    • Enhanced health checks with concurrent dependency verification and granular failure reporting
    • Added heartbeat monitoring for event consumer processes
  • Bug Fixes

    • Health checks now gracefully tolerate individual dependency failures and provide detailed per-dependency status
    • Improved graceful shutdown handling for worker processes
  • Chores

    • Updated dependency versions

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces comprehensive health check improvements across the API and worker services. It adds concurrent dependency health checks (Redis, Database, ClickHouse) with individual failure tolerance in the API healthcheck controller, implements Kubernetes-compatible health endpoints (/healthz/live, /healthz/ready) in the worker service with heartbeat monitoring for events consumers, and creates new utility modules for graceful shutdown state management and events consumer heartbeat tracking.

Changes

Cohort / File(s) Summary
Health Check Endpoints
apps/api/src/controllers/healthcheck.controller.ts, apps/worker/src/index.ts
Reworked /healthcheck to concurrently evaluate Redis, DB, and ClickHouse with per-dependency error capture; added Kubernetes endpoints /healthz/live (always 200) and /healthz/ready (shutdown-aware with events heartbeat monitoring). Both endpoints now use tryCatch for graceful failure handling and return structured dependency results.
Graceful Shutdown & Heartbeat Utilities
apps/worker/src/utils/graceful-shutdown.ts, apps/worker/src/utils/worker-heartbeat.ts, apps/worker/src/boot-workers.ts
Introduced two new utility modules: graceful-shutdown managing shutdown state flags and worker-heartbeat tracking events consumer activity timestamps. Boot workers now conditionally enable heartbeat on events shard startup and update it on BullMQ/groupmq lifecycle events; error and termination handlers mark process as shutting down before graceful exit.
Common Try-Catch Utility
packages/common/src/try-catch.ts, packages/common/index.ts
Added typed discriminated union TryCatchResult<T, E> and generic tryCatch helper supporting both thunks and direct promises, returning { ok, data, error } structure for safe error handling across the codebase.
Dependency Upgrade
pnpm-workspace.yaml
Updated groupmq from 2.0.0-next.3 to 2.0.0-next.4 in workspace catalog.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as API Healthcheck
    participant Redis as Redis
    participant DB as Database
    participant CH as ClickHouse

    Client->>API: GET /healthcheck
    par Concurrent Checks
        API->>Redis: PING
        API->>DB: SELECT 1
        API->>CH: SELECT 1
    and
        rect rgba(255, 0, 0, 0.5)
            Note over Redis: Executes async
        end
        rect rgba(255, 0, 0, 0.5)
            Note over DB: Executes async
        end
        rect rgba(255, 0, 0, 0.5)
            Note over CH: Executes async
        end
    end
    
    alt All Dependencies Succeed
        Redis-->>API: Connected
        DB-->>API: Responsive
        CH-->>API: Available
        API->>Client: 200 OK {ready: true, dependencies: {...}}
    else Any Dependency Fails
        Redis-->>API: Error
        DB-->>API: Error/Connected
        CH-->>API: Error/Available
        API->>Client: 503 Service Unavailable {ready: false, dependencies: {...}, errors: {...}}
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • feat: graceful shutdown #205: Modifies API healthcheck controller behavior and introduces/consumes graceful-shutdown utilities, with parallel rework of health check dependencies and shutdown handling patterns.

Poem

🐰 Hark, the services now beat as one,
Health checks dance, concurrent and fun,
Redis, Database, ClickHouse aligned,
Graceful shutdown, no stray threads left behind,
A heartbeat keeps the workers alive!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: improving healthcheck functionality across the API and worker services with concurrent dependency checks, better error handling, and Kubernetes-compatible endpoints.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/healthchecks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
apps/worker/src/index.ts (2)

29-29: Consider making the stale threshold configurable via env.

EVENTS_HEARTBEAT_STALE_MS is hard-coded to 60s. Since it's tightly coupled to EVENT_BLOCKING_TIMEOUT_SEC (which is already env-configurable), operators who tune the blocking timeout upward will silently get false-positive 503s. An env knob (or derivation from EVENT_BLOCKING_TIMEOUT_SEC * N) would avoid this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/index.ts` at line 29, EVENTS_HEARTBEAT_STALE_MS is hard-coded
to 60_000 and can become inconsistent with the env-configured
EVENT_BLOCKING_TIMEOUT_SEC; make the stale threshold configurable by reading an
env var (e.g., EVENTS_HEARTBEAT_STALE_MS) or derive it from
EVENT_BLOCKING_TIMEOUT_SEC (e.g., multiply by a safety factor) in the same
module where EVENTS_HEARTBEAT_STALE_MS is defined so operators can tune it via
environment and avoid false-positive 503s when EVENT_BLOCKING_TIMEOUT_SEC is
increased.

78-120: DRY: this /healthcheck handler is a near-verbatim duplicate of the API's healthcheck controller.

The same Redis/DB/CH probe logic, the same dependencies/dependencyErrors/failed|workingDependencies shape, and the same 200/503 decision live in apps/api/src/controllers/healthcheck.controller.ts (lines 12–58). Drift between the two is inevitable. Consider extracting a shared helper (e.g. probeDependencies() returning { status, body, log }) into @openpanel/common or a small @openpanel/health package, and calling it from both endpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/index.ts` around lines 78 - 120, The worker healthcheck
duplicates the API controller's probe logic; extract the shared probe into a new
helper (e.g. probeDependencies) in a common package (like `@openpanel/common` or
`@openpanel/health`) that performs the Promise.all probes (using
getRedisCache().ping(), db.$executeRaw`SELECT 1`, chQuery('SELECT 1')), computes
dependencies and dependencyErrors, determines status (200/503), builds
failedDependencies and workingDependencies, and returns a structured result
(suggested shape: { status, body, log }) so both the worker's handler in
apps/worker/src/index.ts and the API's healthcheck.controller.ts can call
probeDependencies(), use the returned status to set response status/body (ready,
dependencies, failedDependencies, workingDependencies), and log via logger.warn
using the returned log payload.
packages/common/src/try-catch.ts (1)

12-14: Unsafe cast of caught value to E.

catch (error) can receive non-Error values (strings, objects, null), but this blindly casts to E (default Error). Current callers use error?.message with optional chaining so it degrades to undefined rather than crashing, but a safer default would be to normalize via error instanceof Error ? error : new Error(String(error)) when E defaults to Error. Not blocking — flagging for future hardening.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/common/src/try-catch.ts` around lines 12 - 14, The catch block
currently unsafely casts the caught value to E (catch (error) { return { ok:
false, data: null, error: error as E }; }), which can be non-Error; instead
normalize the caught value before returning: when the generic E is the default
Error type, convert non-Error values to an Error (e.g., error instanceof Error ?
error : new Error(String(error))) and then return that normalizedError (cast to
E only as a last step); update the catch in try-catch.ts to build a safe
normalizedError and return { ok: false, data: null, error: normalizedError as E
} so callers can reliably access .message without runtime surprises.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/worker/src/boot-workers.ts`:
- Around line 157-158: Remove the non-existent drained listener (the
worker.on('drained', markEventsActivity) line) and instead implement a periodic
idle heartbeat: set up a timer (e.g., setInterval) that calls markEventsActivity
at a safe interval shorter than the readiness timeout (e.g., 30s) so idle
workers stay healthy; start this timer when the Worker (groupmq Worker) is
initialized and ensure you clear it on terminal events such as 'closed', 'error'
or 'graceful-timeout' to avoid leaks, leaving the existing
worker.on('completed', markEventsActivity) intact for job completion updates.

In `@apps/worker/src/index.ts`:
- Around line 131-152: The readiness endpoint /healthz/ready can return 200
before workers are booted because app.listen(PORT, ...) starts the HTTP server
prior to bootWorkers()/bootCron() calling enableEventsHeartbeat(); change the
startup sequencing so readiness is false until boot complete: either delay
app.listen until after bootWorkers()/bootCron() resolve or add an explicit
booted flag (e.g., bootComplete = false -> true after bootWorkers()/bootCron()
finish) and update the /healthz/ready handler (which currently uses
isShuttingDown() and getEventsHeartbeat()/EVENTS_HEARTBEAT_STALE_MS) to return
503 when bootComplete is false; ensure the flag is set only after
enableEventsHeartbeat() is active so the endpoint never reports ready before
workers are running.

---

Nitpick comments:
In `@apps/worker/src/index.ts`:
- Line 29: EVENTS_HEARTBEAT_STALE_MS is hard-coded to 60_000 and can become
inconsistent with the env-configured EVENT_BLOCKING_TIMEOUT_SEC; make the stale
threshold configurable by reading an env var (e.g., EVENTS_HEARTBEAT_STALE_MS)
or derive it from EVENT_BLOCKING_TIMEOUT_SEC (e.g., multiply by a safety factor)
in the same module where EVENTS_HEARTBEAT_STALE_MS is defined so operators can
tune it via environment and avoid false-positive 503s when
EVENT_BLOCKING_TIMEOUT_SEC is increased.
- Around line 78-120: The worker healthcheck duplicates the API controller's
probe logic; extract the shared probe into a new helper (e.g. probeDependencies)
in a common package (like `@openpanel/common` or `@openpanel/health`) that performs
the Promise.all probes (using getRedisCache().ping(), db.$executeRaw`SELECT 1`,
chQuery('SELECT 1')), computes dependencies and dependencyErrors, determines
status (200/503), builds failedDependencies and workingDependencies, and returns
a structured result (suggested shape: { status, body, log }) so both the
worker's handler in apps/worker/src/index.ts and the API's
healthcheck.controller.ts can call probeDependencies(), use the returned status
to set response status/body (ready, dependencies, failedDependencies,
workingDependencies), and log via logger.warn using the returned log payload.

In `@packages/common/src/try-catch.ts`:
- Around line 12-14: The catch block currently unsafely casts the caught value
to E (catch (error) { return { ok: false, data: null, error: error as E }; }),
which can be non-Error; instead normalize the caught value before returning:
when the generic E is the default Error type, convert non-Error values to an
Error (e.g., error instanceof Error ? error : new Error(String(error))) and then
return that normalizedError (cast to E only as a last step); update the catch in
try-catch.ts to build a safe normalizedError and return { ok: false, data: null,
error: normalizedError as E } so callers can reliably access .message without
runtime surprises.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19defc40-3643-456f-a719-f80a0e74c3f8

📥 Commits

Reviewing files that changed from the base of the PR and between 7cb7299 and 4955767.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (8)
  • apps/api/src/controllers/healthcheck.controller.ts
  • apps/worker/src/boot-workers.ts
  • apps/worker/src/index.ts
  • apps/worker/src/utils/graceful-shutdown.ts
  • apps/worker/src/utils/worker-heartbeat.ts
  • packages/common/index.ts
  • packages/common/src/try-catch.ts
  • pnpm-workspace.yaml

Comment on lines +157 to +158
worker.on('completed', markEventsActivity);
worker.on('drained', markEventsActivity);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect groupmq's GroupWorker implementation for emitted events.
fd -t d '^groupmq$' node_modules 2>/dev/null | head -5
rg -nP "emit\s*\(\s*['\"](completed|drained)" --glob '**/groupmq/**' -C2 2>/dev/null || \
  rg -nP "emit\s*\(\s*['\"](completed|drained)" -C2

Repository: Openpanel-dev/openpanel

Length of output: 49


🌐 Web query:

groupmq 2.0.0-next.4 Worker events completed drained

💡 Result:

For “groupmq 2.0.0-next.4 Worker events completed drained” there isn’t enough evidence in the sources I found to answer what GroupMQ emits for “completed” and “drained” worker events in version 2.0.0-next.4 specifically. What I could verify: 1) GroupMQ is a Redis-backed per-group FIFO queue for Node/TypeScript, with a BullMQ-compatible API (site + README). It provides a Worker class and worker options (including concurrency), but the sources I found do not describe any “worker events” named “completed” or “drained”, nor do they document event payload semantics. 2) The only “completed/drained” event behavior I found is for BullMQ (different library). In BullMQ, there is a “drained” event and it has been discussed/changed regarding when it emits (e.g., only once when the worker is idle; and older versions emitting more often). This does not establish GroupMQ’s 2.0.0-next.4 behavior. 3) I located GroupMQ releases up to 2.0.0-next.1 (2026-03-12). I did not find a 2.0.0-next.4 release page/changelog entry in the searched results, so I couldn’t corroborate “2.0.0-next.4” content. Conclusion: I cannot confidently state what “completed” and “drained” worker events are in groupmq 2.0.0-next.4, or whether “drained” exists / how it fires. If you want, paste the exact code snippet where you attach the listeners (e.g., worker.on('completed'...) and worker.on('drained'...)) or share the groupmq 2.0.0-next.4 changelog / worker source file path, and I can re-run a targeted search against the exact implementation and produce a definitive, version-specific answer.

Citations:


🏁 Script executed:

#!/bin/bash
# 1. Find package files and check groupmq version
fd -type f -name "package*.json" | head -20 | xargs grep -l "groupmq" 2>/dev/null | head -3

# 2. Extract groupmq version from lock/package files
fd -type f -name "package*.json" | xargs cat 2>/dev/null | grep -A 2 '"groupmq"' | head -15

Repository: Openpanel-dev/openpanel

Length of output: 408


🏁 Script executed:

#!/bin/bash
# 3. Read the boot-workers.ts file mentioned in review to see context
cat -n apps/worker/src/boot-workers.ts | sed -n '140,170p'

Repository: Openpanel-dev/openpanel

Length of output: 1416


🏁 Script executed:

#!/bin/bash
# 4. Search for groupmq package in node_modules and find its main export
find node_modules -maxdepth 2 -type d -name groupmq 2>/dev/null | head -1

Repository: Openpanel-dev/openpanel

Length of output: 49


🏁 Script executed:

#!/bin/bash
# 5. Look for GroupWorker class definition in repository
rg -n "class GroupWorker" --type ts --type tsx --type js --type jsx

Repository: Openpanel-dev/openpanel

Length of output: 94


🏁 Script executed:

#!/bin/bash
# Check for groupmq in package locks and version
cat package-lock.json 2>/dev/null | jq '.packages | to_entries[] | select(.key | contains("groupmq")) | {key, version: .value.version}' 2>/dev/null || \
cat pnpm-lock.yaml 2>/dev/null | grep -A 2 "groupmq" | head -10 || \
grep "groupmq" yarn.lock 2>/dev/null | head -5

Repository: Openpanel-dev/openpanel

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Find groupmq location (likely in node_modules)
find node_modules -maxdepth 2 -name "groupmq" -type d 2>/dev/null

Repository: Openpanel-dev/openpanel

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Search for GroupWorker class definition - use correct rg syntax
rg -n "class GroupWorker" -t ts -t js

Repository: Openpanel-dev/openpanel

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Search for groupmq mentions anywhere in the repo
rg "groupmq" -i | head -20

Repository: Openpanel-dev/openpanel

Length of output: 2333


🏁 Script executed:

#!/bin/bash
# Check package.json structure
ls -la package.json 2>/dev/null && head -30 package.json

Repository: Openpanel-dev/openpanel

Length of output: 1319


🏁 Script executed:

#!/bin/bash
# Find where GroupWorker is imported from
rg "GroupWorker" -B 2 -A 2 | head -40

Repository: Openpanel-dev/openpanel

Length of output: 1407


🏁 Script executed:

#!/bin/bash
# Check if groupmq is a workspace package (internal) or external dependency
ls -la packages/groupmq 2>/dev/null || echo "Not found in packages"
cat pnpm-workspace.yaml | grep -A 20 "packages:"

Repository: Openpanel-dev/openpanel

Length of output: 385


🏁 Script executed:

#!/bin/bash
# Check the actual groupmq source - look for it in node_modules or as workspace package
find . -path ./node_modules -prune -o -type f -name "*.ts" -print | xargs grep -l "class GroupWorker\|export.*GroupWorker" 2>/dev/null | head -5

Repository: Openpanel-dev/openpanel

Length of output: 49


🌐 Web query:

Openpanel-dev groupmq 2.0.0-next.4 Worker class events completed drained emit

💡 Result:

For Openpanel-dev/groupmq, the “drained” and “completed” worker/job lifecycle events you mention are not found in the official GroupMQ docs/pages returned by the research. The only worker events explicitly documented in the Openpanel-dev/groupmq README/docs are: ready, completed, failed, error, closed, and graceful-timeout. Additionally, GroupMQ’s documented job completion flow includes moving a successfully processed job into a :completed retention sorted set and unlocking the group lock, but this is internal state management—not an exposed “drained” event. So, based on the sources located for this query, there is no evidence of a “Worker class events … drained … completed … emit” sequence in groupmq 2.0.0-next.4. What is evidenced: - Worker emits documented events including completed (and other events), and drained is not listed. - Job completion flow marks the job as completed in the queue’s :completed sorted set and unlocks the group. If you can share the exact file/snippet from groupmq 2.0.0-next.4 you’re looking at (or the expected event names), I can re-check precisely whether drained exists under a different name, is part of QueueEvents (not Worker), or was removed/changed in the 2.0.0-next.x line.

Citations:


🏁 Script executed:

#!/bin/bash
# Look at node_modules to find groupmq source
ls -la node_modules/groupmq 2>/dev/null | head -20
# If it exists, check main entry point
cat node_modules/groupmq/package.json 2>/dev/null | jq '.main, .exports'

Repository: Openpanel-dev/openpanel

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Check the actual event listeners being used in boot-workers.ts
grep -n "worker.on\|markEventsActivity" apps/worker/src/boot-workers.ts | head -20

Repository: Openpanel-dev/openpanel

Length of output: 193


🏁 Script executed:

#!/bin/bash
# Try to find groupmq Worker implementation in node_modules
find node_modules/groupmq -name "*.js" -o -name "*.ts" 2>/dev/null | head -20

Repository: Openpanel-dev/openpanel

Length of output: 49


🌐 Web query:

site:github.com/Openpanel-dev/groupmq Worker class emit events implementation

💡 Result:

In libraries like GroupMQ (a BullMQ-inspired Redis job queue for Node.js/TypeScript), the Worker class emits events such as 'ready', 'completed', 'failed', 'error', 'closed', and 'graceful-timeout'. These are listened to using standard Node.js EventEmitter syntax: worker.on('event', handler). Example implementation from GroupMQ documentation: worker.on('ready', => { console.log('Worker is ready'); }); worker.on('completed', (job) => { console.log('Job completed:', job.id); }); worker.on('failed', (job) => { console.log('Job failed:', job.id, job.failedReason); }); worker.on('error', (error) => { console.error('Worker error:', error); }); worker.on('closed', => { console.log('Worker closed'); }); worker.on('graceful-timeout', (job) => { console.log('Job exceeded graceful timeout:', job.id); }); // Remove listeners worker.off('completed', handler); worker.removeAllListeners; The Worker class extends Node.js EventEmitter, allowing it to emit these lifecycle and job status events for monitoring and handling in queue processing systems.

Citations:


🏁 Script executed:

#!/bin/bash
# Check the actual GroupMQ source code for Worker.emit calls
# GroupMQ is at catalog version 2.0.0-next.4, search for its source on GitHub
# Let's look at the local boot-workers more carefully to understand the context
cat -n apps/worker/src/boot-workers.ts | sed -n '1,50p'

Repository: Openpanel-dev/openpanel

Length of output: 2182


🏁 Script executed:

#!/bin/bash
# Check what markEventsActivity does and how often it's called
rg -n "markEventsActivity" -B 5 -A 5 apps/worker/src/

Repository: Openpanel-dev/openpanel

Length of output: 2485


🌐 Web query:

site:github.com/Openpanel-dev/groupmq "drained" event Worker emit

💡 Result:

No evidence found of a "drained" event emitted by a Worker in any job queue library, including BullMQ, groupmq, Bee-Queue, or others. Searches across official documentation and code repositories (primarily groupmq README and BullMQ-inspired libraries) show Worker events like 'ready', 'completed', 'failed', 'error', 'closed', and 'graceful-timeout', but no 'drained' event. The query may refer to a custom implementation, a different library, or a misremembered event name. In Bee-Queue, 'drained' exists but as a queue event when no jobs remain, not emitted by Worker.

Citations:


🏁 Script executed:

#!/bin/bash
# Let's check the actual GroupMQ source from the commit referenced in web results
# Look for any emit calls in groupmq that mention "drained"
rg "emit.*drained" --type ts --type js 2>/dev/null || echo "No 'drained' emit found"

# Also check what events ARE emitted by the Worker class
rg "this\.emit|worker\.emit" apps/worker/src/boot-workers.ts

Repository: Openpanel-dev/openpanel

Length of output: 90


🌐 Web query:

site:github.com/Openpanel-dev/groupmq Worker.emit all events 2.0.0-next

💡 Result:

In the groupmq library (a Redis-backed per-group FIFO queue for Node.js/TypeScript), Worker instances extend TypedEventEmitter and emit the following events: - 'ready': Emitted when the worker is ready. - 'completed': Emitted when a job completes, with the Job object. - 'failed': Emitted when a job fails, with the Job object and failedReason. - 'error': Emitted on worker errors, with the Error. - 'closed': Emitted when the worker closes. - 'graceful-timeout': Emitted when a job exceeds graceful timeout, with the Job object. No version 2.0.0-next or Cloudflare Workers context found matching the query. Latest groupmq version is v1.1.0 (Nov 2025), with next tags like 1.1.1-next.1. Use worker.on(event, handler) to listen and worker.off or worker.removeAllListeners to remove.

Citations:


Remove the worker.on('drained') listener — the event does not exist in groupmq 2.0.0-next.4.

The groupmq Worker class emits only: 'ready', 'completed', 'failed', 'error', 'closed', and 'graceful-timeout'. There is no 'drained' event. This means the heartbeat will only refresh when jobs complete; during idle periods (no jobs in queue), markEventsActivity() never fires, and the readiness probe will report 503 after 60 seconds.

Either implement a separate idle heartbeat mechanism (e.g., a periodic timer that calls markEventsActivity() at shorter intervals) or accept that the readiness probe will cycle to unhealthy during sustained idle periods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/boot-workers.ts` around lines 157 - 158, Remove the
non-existent drained listener (the worker.on('drained', markEventsActivity)
line) and instead implement a periodic idle heartbeat: set up a timer (e.g.,
setInterval) that calls markEventsActivity at a safe interval shorter than the
readiness timeout (e.g., 30s) so idle workers stay healthy; start this timer
when the Worker (groupmq Worker) is initialized and ensure you clear it on
terminal events such as 'closed', 'error' or 'graceful-timeout' to avoid leaks,
leaving the existing worker.on('completed', markEventsActivity) intact for job
completion updates.

Comment thread apps/worker/src/index.ts
Comment on lines +131 to 152
app.get('/healthz/ready', (req, res) => {
if (isShuttingDown()) {
res.status(503).json({ ready: false, reason: 'shutting down' });
return;
}

const { enabled, lastActivityAt } = getEventsHeartbeat();
if (enabled) {
const idleMs = Date.now() - lastActivityAt;
if (idleMs > EVENTS_HEARTBEAT_STALE_MS) {
res.status(503).json({
ready: false,
reason: 'events consumer heartbeat stale',
idleMs,
thresholdMs: EVENTS_HEARTBEAT_STALE_MS,
});
return;
}
}

res.status(200).json({ ready: true });
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Startup race: /healthz/ready returns 200 before workers are booted.

app.listen(PORT, …) at line 154 runs before bootWorkers() at line 159, and enableEventsHeartbeat() is only called inside bootWorkers. In the window between the HTTP server starting and bootWorkers() completing, enabled is false, so this endpoint reports ready even though no events consumer is running yet. If K8s scrapes readiness in that window it may route traffic / mark the pod ready prematurely.

Consider flipping an explicit "booted" flag after bootWorkers()/bootCron() resolve and returning 503 until then, or starting the HTTP server after boot.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/index.ts` around lines 131 - 152, The readiness endpoint
/healthz/ready can return 200 before workers are booted because app.listen(PORT,
...) starts the HTTP server prior to bootWorkers()/bootCron() calling
enableEventsHeartbeat(); change the startup sequencing so readiness is false
until boot complete: either delay app.listen until after
bootWorkers()/bootCron() resolve or add an explicit booted flag (e.g.,
bootComplete = false -> true after bootWorkers()/bootCron() finish) and update
the /healthz/ready handler (which currently uses isShuttingDown() and
getEventsHeartbeat()/EVENTS_HEARTBEAT_STALE_MS) to return 503 when bootComplete
is false; ensure the flag is set only after enableEventsHeartbeat() is active so
the endpoint never reports ready before workers are running.

@lindesvard lindesvard merged commit 195b72c into main Apr 24, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant