fix: release periodic lock when no active checkers exist by Baoyuantop · Pull Request #54 · api7/lua-resty-healthcheck

Baoyuantop · 2026-04-17T03:45:12Z

Problem

When all health checkers on a worker are stopped (active=false), the timer callback still acquires and renews the PERIODIC_LOCK, preventing other workers with active checkers from performing health checks.

This causes health checks to become permanently stuck after a disable → re-enable cycle:

Worker A holds the lock with active checkers — health check works ✓
Health check is disabled → checker:stop() sets active=false
Worker A's timer keeps running → acquires lock → renews it → but all checkers are inactive → does nothing
Health check is re-enabled → Worker B creates new active checker
Worker B's timer tries to acquire lock → fails (Worker A holds it) → backs off to CHECK_INTERVAL * 10
Result: Worker A holds lock but does no work; Worker B has active checker but can't get lock

Root Cause

In the expire callback of active_check_timer:

if get_periodic_lock(shm, key) then
  active_check_timer.interval = CHECK_INTERVAL
  renew_periodic_lock(shm, key)   -- unconditionally renews!
else
  active_check_timer.interval = CHECK_INTERVAL * 10
  return
end

The lock is renewed unconditionally after acquisition, without checking whether the current worker has any active checkers. Meanwhile, checker:stop() only sets active=false but does not release the lock.

Fix

After acquiring the periodic lock, check if there are any active checkers in the hcs table. If none exist, release the lock via shm:delete(key) and return, allowing other workers to acquire the lock.

if get_periodic_lock(shm, key) then
  local has_active_checker = false
  for _, checker_obj in pairs(hcs) do
    if checker_obj.checks.active.healthy.active or
       checker_obj.checks.active.unhealthy.active then
      has_active_checker = true
      break
    end
  end
  if not has_active_checker then
    shm:delete(key)
    active_check_timer.interval = CHECK_INTERVAL * 10
    return
  end
  active_check_timer.interval = CHECK_INTERVAL
  renew_periodic_lock(shm, key)
  ...

Summary by CodeRabbit

Performance
- Optimized periodic health check operations to skip unnecessary lock renewals when no active checks are enabled, reducing system resource consumption.

When all health checkers on a worker are stopped (active=false), the timer callback still acquires and renews the PERIODIC_LOCK, preventing other workers with active checkers from performing health checks. This causes health checks to become permanently stuck after a disable -> re-enable cycle: 1. Worker A holds the lock with active checkers 2. Health check is disabled: checker:stop() sets active=false 3. Worker A's timer keeps renewing the lock with no active checkers 4. Health check is re-enabled: Worker B creates new checker 5. Worker B cannot acquire the lock -> health check never runs Fix: After acquiring the periodic lock, check if any active checkers exist. If none, release the lock (shm:delete) and return, allowing other workers to acquire the lock and run their active checkers. Closes: apache/apisix#13235 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-04-17T03:45:26Z

📝 Walkthrough

Walkthrough

The code adds conditional logic to the periodic active-check lock renewal process. It scans healthchecker instances to detect whether any active checks remain. If none are active, it deletes the lock, backs off the timer interval by 10x, and returns early. Otherwise, it maintains the normal renewal behavior.

Changes

Cohort / File(s)	Summary
Lock Renewal Optimization `lib/resty/healthcheck.lua`	Adds conditional check during periodic lock renewal to detect inactive healthcheckers, delete the lock, and apply exponential backoff (10x) to the timer interval when no active checks exist, otherwise proceeding with standard lock renewal.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
E2e Test Quality Review	⚠️ Warning	PR lacks E2E tests for concurrency fix and has unprotected race condition in hcs table iteration.	Add E2E tests with multi-worker configuration and protect hcs table iteration with synchronization or safe snapshot approach.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: release periodic lock when no active checkers exist' directly and clearly summarizes the main change in the pull request.
Security Check	✅ Passed	The 17-line addition safely checks boolean flags on local healthchecker instances, properly releases shared memory locks when no active checkers exist, and improves resource isolation without introducing security vulnerabilities across all seven vulnerability categories.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

lib/resty/healthcheck.lua (1)
1737-1760: Logic correctly addresses the stuck-lock scenario; consider one small clarification.

The fix is sound:

hcs is a weak-valued table populated via table.insert(hcs, self) after start() (line 1821), so any visible checker already has its active flags correctly set, avoiding a TOCTOU on construction.

Deleting our own held key is safe — a competing worker's shm:add is atomic and will correctly take ownership on the next tick.

The predicate (healthy.active or unhealthy.active) matches the same flags consulted later at lines 1798 and 1806, so it stays in sync if stop()/start() toggle them.

A couple of small things worth noting (not blockers):

After shm:delete(key) and the 10× back-off, if a user immediately calls start() on a new/re-enabled checker on this worker, health checks won't resume until the next tick (up to ~1s). That matches the PR's intent, but it may be worth a short comment so future readers don't assume sub-100ms responsiveness on re-enable.

The early return also skips the stale-target cleanup block (lines 1765–1796). That's fine in practice (cleanup is moot when nothing is being checked), but it's another reason the back-off should be bounded — which it already is.
📝 Optional clarifying comment
         if get_periodic_lock(shm, key) then
           -- Check if there are any active checkers before renewing the lock.
           -- When all checkers have been stopped (active=false), the lock holder
           -- must release the lock so that other workers with active checkers
           -- can acquire it.
+          -- Note: after release we back off to CHECK_INTERVAL * 10, so a
+          -- freshly re-enabled checker on this worker may take up to one
+          -- back-off interval before active checks resume.
           local has_active_checker = false
           for _, checker_obj in pairs(hcs) do
             if checker_obj.checks.active.healthy.active or
                checker_obj.checks.active.unhealthy.active then
               has_active_checker = true
               break
             end
           end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/resty/healthcheck.lua` around lines 1737 - 1760, Add a short clarifying
comment inside the branch where we delete our lock and set
active_check_timer.interval = CHECK_INTERVAL * 10 (the block that runs after
get_periodic_lock(shm, key) and calls shm:delete(key)); state that deleting the
key and applying the 10× back-off means newly started or re-enabled checkers on
this worker may not resume until the next tick (up to ~1s), and note that this
early return also skips the stale-target cleanup block so cleanup won't run
while nothing is active. Reference the surrounding symbols get_periodic_lock,
shm:delete, active_check_timer.interval and CHECK_INTERVAL, as well as the hcs
list and start()/stop() behavior in the comment so future readers understand the
trade-off and bounded back-off.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/resty/healthcheck.lua`:
- Around line 1737-1760: Add a short clarifying comment inside the branch where
we delete our lock and set active_check_timer.interval = CHECK_INTERVAL * 10
(the block that runs after get_periodic_lock(shm, key) and calls
shm:delete(key)); state that deleting the key and applying the 10× back-off
means newly started or re-enabled checkers on this worker may not resume until
the next tick (up to ~1s), and note that this early return also skips the
stale-target cleanup block so cleanup won't run while nothing is active.
Reference the surrounding symbols get_periodic_lock, shm:delete,
active_check_timer.interval and CHECK_INTERVAL, as well as the hcs list and
start()/stop() behavior in the comment so future readers understand the
trade-off and bounded back-off.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c56f331-437c-4c98-938a-b9da1a01ad01

📥 Commits

Reviewing files that changed from the base of the PR and between 3a6ddf0 and 00051bd.

📒 Files selected for processing (1)

lib/resty/healthcheck.lua

Copilot

Pull request overview

Fixes a multi-worker deadlock scenario where a worker with no active health checkers can keep renewing the shared periodic lock, preventing other workers with active checkers from running health checks.

Changes:

After acquiring PERIODIC_LOCK, detect whether any checker in hcs still has active healthy/unhealthy checks.
If none are active, release the lock from shm and back off the timer interval instead of renewing the lock.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T03:50:28Z

+          local has_active_checker = false
+          for _, checker_obj in pairs(hcs) do
+            if checker_obj.checks.active.healthy.active or
+               checker_obj.checks.active.unhealthy.active then
+              has_active_checker = true
+              break
+            end
+          end
+          if not has_active_checker then
+            shm:delete(key)
+            active_check_timer.interval = CHECK_INTERVAL * 10
+            return


The early return when has_active_checker is false skips the periodic cleanup section below (purging targets with purge_time set by delayed_clear). This can leave delayed-clear state in shm indefinitely whenever active checks are disabled/stopped across all checkers on the lock-holding worker. Consider running the cleanup logic before releasing the periodic lock (or making the “no active checker” condition account for pending purge_time work) so delayed_clear still completes even when active checks are turned off.

Copilot · 2026-04-17T03:50:29Z

        if get_periodic_lock(shm, key) then
+          -- Check if there are any active checkers before renewing the lock.
+          -- When all checkers have been stopped (active=false), the lock holder
+          -- must release the lock so that other workers with active checkers
+          -- can acquire it.
+          local has_active_checker = false
+          for _, checker_obj in pairs(hcs) do
+            if checker_obj.checks.active.healthy.active or
+               checker_obj.checks.active.unhealthy.active then
+              has_active_checker = true
+              break
+            end
+          end
+          if not has_active_checker then
+            shm:delete(key)
+            active_check_timer.interval = CHECK_INTERVAL * 10
+            return
+          end
          active_check_timer.interval = CHECK_INTERVAL
          renew_periodic_lock(shm, key)


This change fixes a multi-worker edge case around PERIODIC_LOCK ownership after checker:stop(), but there doesn’t appear to be a regression test covering the disable → re-enable flow described in the PR (one worker stops all checkers, another later creates/starts an active checker and must be able to acquire the periodic lock). Adding a 2-worker Test::Nginx case exercising that flow would help prevent this lock-stuck behavior from reappearing.

Copilot AI review requested due to automatic review settings April 17, 2026 03:45

Copilot started reviewing on behalf of Baoyuantop April 17, 2026 03:45 View session

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Baoyuantop mentioned this pull request Apr 17, 2026

bug: healthcheck not work apache/apisix#13235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: release periodic lock when no active checkers exist#54

fix: release periodic lock when no active checkers exist#54
Baoyuantop wants to merge 1 commit intoapi7:masterfrom
Baoyuantop:fix/release-lock-when-no-active-checkers

Baoyuantop commented Apr 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Baoyuantop commented Apr 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Related

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Baoyuantop commented Apr 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 17, 2026 •

edited

Loading