fix: release periodic lock when no active checkers exist#54
fix: release periodic lock when no active checkers exist#54Baoyuantop wants to merge 1 commit intoapi7:masterfrom
Conversation
When all health checkers on a worker are stopped (active=false), the timer callback still acquires and renews the PERIODIC_LOCK, preventing other workers with active checkers from performing health checks. This causes health checks to become permanently stuck after a disable -> re-enable cycle: 1. Worker A holds the lock with active checkers 2. Health check is disabled: checker:stop() sets active=false 3. Worker A's timer keeps renewing the lock with no active checkers 4. Health check is re-enabled: Worker B creates new checker 5. Worker B cannot acquire the lock -> health check never runs Fix: After acquiring the periodic lock, check if any active checkers exist. If none, release the lock (shm:delete) and return, allowing other workers to acquire the lock and run their active checkers. Closes: apache/apisix#13235 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughThe code adds conditional logic to the periodic active-check lock renewal process. It scans healthchecker instances to detect whether any active checks remain. If none are active, it deletes the lock, backs off the timer interval by 10x, and returns early. Otherwise, it maintains the normal renewal behavior. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/resty/healthcheck.lua (1)
1737-1760: Logic correctly addresses the stuck-lock scenario; consider one small clarification.The fix is sound:
hcsis a weak-valued table populated viatable.insert(hcs, self)afterstart()(line 1821), so any visible checker already has itsactiveflags correctly set, avoiding a TOCTOU on construction.- Deleting our own held key is safe — a competing worker's
shm:addis atomic and will correctly take ownership on the next tick.- The predicate (
healthy.active or unhealthy.active) matches the same flags consulted later at lines 1798 and 1806, so it stays in sync ifstop()/start()toggle them.A couple of small things worth noting (not blockers):
- After
shm:delete(key)and the 10× back-off, if a user immediately callsstart()on a new/re-enabled checker on this worker, health checks won't resume until the next tick (up to ~1s). That matches the PR's intent, but it may be worth a short comment so future readers don't assume sub-100ms responsiveness on re-enable.- The early return also skips the stale-target cleanup block (lines 1765–1796). That's fine in practice (cleanup is moot when nothing is being checked), but it's another reason the back-off should be bounded — which it already is.
📝 Optional clarifying comment
if get_periodic_lock(shm, key) then -- Check if there are any active checkers before renewing the lock. -- When all checkers have been stopped (active=false), the lock holder -- must release the lock so that other workers with active checkers -- can acquire it. + -- Note: after release we back off to CHECK_INTERVAL * 10, so a + -- freshly re-enabled checker on this worker may take up to one + -- back-off interval before active checks resume. local has_active_checker = false for _, checker_obj in pairs(hcs) do if checker_obj.checks.active.healthy.active or checker_obj.checks.active.unhealthy.active then has_active_checker = true break end end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/resty/healthcheck.lua` around lines 1737 - 1760, Add a short clarifying comment inside the branch where we delete our lock and set active_check_timer.interval = CHECK_INTERVAL * 10 (the block that runs after get_periodic_lock(shm, key) and calls shm:delete(key)); state that deleting the key and applying the 10× back-off means newly started or re-enabled checkers on this worker may not resume until the next tick (up to ~1s), and note that this early return also skips the stale-target cleanup block so cleanup won't run while nothing is active. Reference the surrounding symbols get_periodic_lock, shm:delete, active_check_timer.interval and CHECK_INTERVAL, as well as the hcs list and start()/stop() behavior in the comment so future readers understand the trade-off and bounded back-off.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/resty/healthcheck.lua`:
- Around line 1737-1760: Add a short clarifying comment inside the branch where
we delete our lock and set active_check_timer.interval = CHECK_INTERVAL * 10
(the block that runs after get_periodic_lock(shm, key) and calls
shm:delete(key)); state that deleting the key and applying the 10× back-off
means newly started or re-enabled checkers on this worker may not resume until
the next tick (up to ~1s), and note that this early return also skips the
stale-target cleanup block so cleanup won't run while nothing is active.
Reference the surrounding symbols get_periodic_lock, shm:delete,
active_check_timer.interval and CHECK_INTERVAL, as well as the hcs list and
start()/stop() behavior in the comment so future readers understand the
trade-off and bounded back-off.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1c56f331-437c-4c98-938a-b9da1a01ad01
📒 Files selected for processing (1)
lib/resty/healthcheck.lua
There was a problem hiding this comment.
Pull request overview
Fixes a multi-worker deadlock scenario where a worker with no active health checkers can keep renewing the shared periodic lock, preventing other workers with active checkers from running health checks.
Changes:
- After acquiring
PERIODIC_LOCK, detect whether any checker inhcsstill has active healthy/unhealthy checks. - If none are active, release the lock from shm and back off the timer interval instead of renewing the lock.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local has_active_checker = false | ||
| for _, checker_obj in pairs(hcs) do | ||
| if checker_obj.checks.active.healthy.active or | ||
| checker_obj.checks.active.unhealthy.active then | ||
| has_active_checker = true | ||
| break | ||
| end | ||
| end | ||
| if not has_active_checker then | ||
| shm:delete(key) | ||
| active_check_timer.interval = CHECK_INTERVAL * 10 | ||
| return |
There was a problem hiding this comment.
The early return when has_active_checker is false skips the periodic cleanup section below (purging targets with purge_time set by delayed_clear). This can leave delayed-clear state in shm indefinitely whenever active checks are disabled/stopped across all checkers on the lock-holding worker. Consider running the cleanup logic before releasing the periodic lock (or making the “no active checker” condition account for pending purge_time work) so delayed_clear still completes even when active checks are turned off.
| if get_periodic_lock(shm, key) then | ||
| -- Check if there are any active checkers before renewing the lock. | ||
| -- When all checkers have been stopped (active=false), the lock holder | ||
| -- must release the lock so that other workers with active checkers | ||
| -- can acquire it. | ||
| local has_active_checker = false | ||
| for _, checker_obj in pairs(hcs) do | ||
| if checker_obj.checks.active.healthy.active or | ||
| checker_obj.checks.active.unhealthy.active then | ||
| has_active_checker = true | ||
| break | ||
| end | ||
| end | ||
| if not has_active_checker then | ||
| shm:delete(key) | ||
| active_check_timer.interval = CHECK_INTERVAL * 10 | ||
| return | ||
| end | ||
| active_check_timer.interval = CHECK_INTERVAL | ||
| renew_periodic_lock(shm, key) |
There was a problem hiding this comment.
This change fixes a multi-worker edge case around PERIODIC_LOCK ownership after checker:stop(), but there doesn’t appear to be a regression test covering the disable → re-enable flow described in the PR (one worker stops all checkers, another later creates/starts an active checker and must be able to acquire the periodic lock). Adding a 2-worker Test::Nginx case exercising that flow would help prevent this lock-stuck behavior from reappearing.
Problem
When all health checkers on a worker are stopped (
active=false), the timer callback still acquires and renews thePERIODIC_LOCK, preventing other workers with active checkers from performing health checks.This causes health checks to become permanently stuck after a disable → re-enable cycle:
checker:stop()setsactive=falseCHECK_INTERVAL * 10Root Cause
In the
expirecallback ofactive_check_timer:The lock is renewed unconditionally after acquisition, without checking whether the current worker has any active checkers. Meanwhile,
checker:stop()only setsactive=falsebut does not release the lock.Fix
After acquiring the periodic lock, check if there are any active checkers in the
hcstable. If none exist, release the lock viashm:delete(key)and return, allowing other workers to acquire the lock.Related
Summary by CodeRabbit