Skip to content

fix: release periodic lock when no active checkers exist#54

Open
Baoyuantop wants to merge 1 commit intoapi7:masterfrom
Baoyuantop:fix/release-lock-when-no-active-checkers
Open

fix: release periodic lock when no active checkers exist#54
Baoyuantop wants to merge 1 commit intoapi7:masterfrom
Baoyuantop:fix/release-lock-when-no-active-checkers

Conversation

@Baoyuantop
Copy link
Copy Markdown

@Baoyuantop Baoyuantop commented Apr 17, 2026

Problem

When all health checkers on a worker are stopped (active=false), the timer callback still acquires and renews the PERIODIC_LOCK, preventing other workers with active checkers from performing health checks.

This causes health checks to become permanently stuck after a disable → re-enable cycle:

  1. Worker A holds the lock with active checkers — health check works ✓
  2. Health check is disabled → checker:stop() sets active=false
  3. Worker A's timer keeps running → acquires lock → renews it → but all checkers are inactive → does nothing
  4. Health check is re-enabled → Worker B creates new active checker
  5. Worker B's timer tries to acquire lock → fails (Worker A holds it) → backs off to CHECK_INTERVAL * 10
  6. Result: Worker A holds lock but does no work; Worker B has active checker but can't get lock

Root Cause

In the expire callback of active_check_timer:

if get_periodic_lock(shm, key) then
  active_check_timer.interval = CHECK_INTERVAL
  renew_periodic_lock(shm, key)   -- unconditionally renews!
else
  active_check_timer.interval = CHECK_INTERVAL * 10
  return
end

The lock is renewed unconditionally after acquisition, without checking whether the current worker has any active checkers. Meanwhile, checker:stop() only sets active=false but does not release the lock.

Fix

After acquiring the periodic lock, check if there are any active checkers in the hcs table. If none exist, release the lock via shm:delete(key) and return, allowing other workers to acquire the lock.

if get_periodic_lock(shm, key) then
  local has_active_checker = false
  for _, checker_obj in pairs(hcs) do
    if checker_obj.checks.active.healthy.active or
       checker_obj.checks.active.unhealthy.active then
      has_active_checker = true
      break
    end
  end
  if not has_active_checker then
    shm:delete(key)
    active_check_timer.interval = CHECK_INTERVAL * 10
    return
  end
  active_check_timer.interval = CHECK_INTERVAL
  renew_periodic_lock(shm, key)
  ...

Related

Summary by CodeRabbit

  • Performance
    • Optimized periodic health check operations to skip unnecessary lock renewals when no active checks are enabled, reducing system resource consumption.

When all health checkers on a worker are stopped (active=false), the
timer callback still acquires and renews the PERIODIC_LOCK, preventing
other workers with active checkers from performing health checks.

This causes health checks to become permanently stuck after a
disable -> re-enable cycle:

1. Worker A holds the lock with active checkers
2. Health check is disabled: checker:stop() sets active=false
3. Worker A's timer keeps renewing the lock with no active checkers
4. Health check is re-enabled: Worker B creates new checker
5. Worker B cannot acquire the lock -> health check never runs

Fix: After acquiring the periodic lock, check if any active checkers
exist. If none, release the lock (shm:delete) and return, allowing
other workers to acquire the lock and run their active checkers.

Closes: apache/apisix#13235

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 03:45
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

The code adds conditional logic to the periodic active-check lock renewal process. It scans healthchecker instances to detect whether any active checks remain. If none are active, it deletes the lock, backs off the timer interval by 10x, and returns early. Otherwise, it maintains the normal renewal behavior.

Changes

Cohort / File(s) Summary
Lock Renewal Optimization
lib/resty/healthcheck.lua
Adds conditional check during periodic lock renewal to detect inactive healthcheckers, delete the lock, and apply exponential backoff (10x) to the timer interval when no active checks exist, otherwise proceeding with standard lock renewal.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
E2e Test Quality Review ⚠️ Warning PR lacks E2E tests for concurrency fix and has unprotected race condition in hcs table iteration. Add E2E tests with multi-worker configuration and protect hcs table iteration with synchronization or safe snapshot approach.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: release periodic lock when no active checkers exist' directly and clearly summarizes the main change in the pull request.
Security Check ✅ Passed The 17-line addition safely checks boolean flags on local healthchecker instances, properly releases shared memory locks when no active checkers exist, and improves resource isolation without introducing security vulnerabilities across all seven vulnerability categories.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/resty/healthcheck.lua (1)

1737-1760: Logic correctly addresses the stuck-lock scenario; consider one small clarification.

The fix is sound:

  • hcs is a weak-valued table populated via table.insert(hcs, self) after start() (line 1821), so any visible checker already has its active flags correctly set, avoiding a TOCTOU on construction.
  • Deleting our own held key is safe — a competing worker's shm:add is atomic and will correctly take ownership on the next tick.
  • The predicate (healthy.active or unhealthy.active) matches the same flags consulted later at lines 1798 and 1806, so it stays in sync if stop()/start() toggle them.

A couple of small things worth noting (not blockers):

  1. After shm:delete(key) and the 10× back-off, if a user immediately calls start() on a new/re-enabled checker on this worker, health checks won't resume until the next tick (up to ~1s). That matches the PR's intent, but it may be worth a short comment so future readers don't assume sub-100ms responsiveness on re-enable.
  2. The early return also skips the stale-target cleanup block (lines 1765–1796). That's fine in practice (cleanup is moot when nothing is being checked), but it's another reason the back-off should be bounded — which it already is.
📝 Optional clarifying comment
         if get_periodic_lock(shm, key) then
           -- Check if there are any active checkers before renewing the lock.
           -- When all checkers have been stopped (active=false), the lock holder
           -- must release the lock so that other workers with active checkers
           -- can acquire it.
+          -- Note: after release we back off to CHECK_INTERVAL * 10, so a
+          -- freshly re-enabled checker on this worker may take up to one
+          -- back-off interval before active checks resume.
           local has_active_checker = false
           for _, checker_obj in pairs(hcs) do
             if checker_obj.checks.active.healthy.active or
                checker_obj.checks.active.unhealthy.active then
               has_active_checker = true
               break
             end
           end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/resty/healthcheck.lua` around lines 1737 - 1760, Add a short clarifying
comment inside the branch where we delete our lock and set
active_check_timer.interval = CHECK_INTERVAL * 10 (the block that runs after
get_periodic_lock(shm, key) and calls shm:delete(key)); state that deleting the
key and applying the 10× back-off means newly started or re-enabled checkers on
this worker may not resume until the next tick (up to ~1s), and note that this
early return also skips the stale-target cleanup block so cleanup won't run
while nothing is active. Reference the surrounding symbols get_periodic_lock,
shm:delete, active_check_timer.interval and CHECK_INTERVAL, as well as the hcs
list and start()/stop() behavior in the comment so future readers understand the
trade-off and bounded back-off.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/resty/healthcheck.lua`:
- Around line 1737-1760: Add a short clarifying comment inside the branch where
we delete our lock and set active_check_timer.interval = CHECK_INTERVAL * 10
(the block that runs after get_periodic_lock(shm, key) and calls
shm:delete(key)); state that deleting the key and applying the 10× back-off
means newly started or re-enabled checkers on this worker may not resume until
the next tick (up to ~1s), and note that this early return also skips the
stale-target cleanup block so cleanup won't run while nothing is active.
Reference the surrounding symbols get_periodic_lock, shm:delete,
active_check_timer.interval and CHECK_INTERVAL, as well as the hcs list and
start()/stop() behavior in the comment so future readers understand the
trade-off and bounded back-off.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c56f331-437c-4c98-938a-b9da1a01ad01

📥 Commits

Reviewing files that changed from the base of the PR and between 3a6ddf0 and 00051bd.

📒 Files selected for processing (1)
  • lib/resty/healthcheck.lua

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a multi-worker deadlock scenario where a worker with no active health checkers can keep renewing the shared periodic lock, preventing other workers with active checkers from running health checks.

Changes:

  • After acquiring PERIODIC_LOCK, detect whether any checker in hcs still has active healthy/unhealthy checks.
  • If none are active, release the lock from shm and back off the timer interval instead of renewing the lock.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/resty/healthcheck.lua
Comment on lines +1742 to +1753
local has_active_checker = false
for _, checker_obj in pairs(hcs) do
if checker_obj.checks.active.healthy.active or
checker_obj.checks.active.unhealthy.active then
has_active_checker = true
break
end
end
if not has_active_checker then
shm:delete(key)
active_check_timer.interval = CHECK_INTERVAL * 10
return
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return when has_active_checker is false skips the periodic cleanup section below (purging targets with purge_time set by delayed_clear). This can leave delayed-clear state in shm indefinitely whenever active checks are disabled/stopped across all checkers on the lock-holding worker. Consider running the cleanup logic before releasing the periodic lock (or making the “no active checker” condition account for pending purge_time work) so delayed_clear still completes even when active checks are turned off.

Copilot uses AI. Check for mistakes.
Comment thread lib/resty/healthcheck.lua
Comment on lines 1737 to 1756
if get_periodic_lock(shm, key) then
-- Check if there are any active checkers before renewing the lock.
-- When all checkers have been stopped (active=false), the lock holder
-- must release the lock so that other workers with active checkers
-- can acquire it.
local has_active_checker = false
for _, checker_obj in pairs(hcs) do
if checker_obj.checks.active.healthy.active or
checker_obj.checks.active.unhealthy.active then
has_active_checker = true
break
end
end
if not has_active_checker then
shm:delete(key)
active_check_timer.interval = CHECK_INTERVAL * 10
return
end
active_check_timer.interval = CHECK_INTERVAL
renew_periodic_lock(shm, key)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a multi-worker edge case around PERIODIC_LOCK ownership after checker:stop(), but there doesn’t appear to be a regression test covering the disable → re-enable flow described in the PR (one worker stops all checkers, another later creates/starts an active checker and must be able to acquire the periodic lock). Adding a 2-worker Test::Nginx case exercising that flow would help prevent this lock-stuck behavior from reappearing.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants