Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions lib/resty/healthcheck.lua
Original file line number Diff line number Diff line change
Expand Up @@ -1735,6 +1735,23 @@ function _M.new(opts)
expire = function()

if get_periodic_lock(shm, key) then
-- Check if there are any active checkers before renewing the lock.
-- When all checkers have been stopped (active=false), the lock holder
-- must release the lock so that other workers with active checkers
-- can acquire it.
local has_active_checker = false
for _, checker_obj in pairs(hcs) do
if checker_obj.checks.active.healthy.active or
checker_obj.checks.active.unhealthy.active then
has_active_checker = true
break
end
end
if not has_active_checker then
shm:delete(key)
active_check_timer.interval = CHECK_INTERVAL * 10
return
Comment on lines +1742 to +1753
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return when has_active_checker is false skips the periodic cleanup section below (purging targets with purge_time set by delayed_clear). This can leave delayed-clear state in shm indefinitely whenever active checks are disabled/stopped across all checkers on the lock-holding worker. Consider running the cleanup logic before releasing the periodic lock (or making the “no active checker” condition account for pending purge_time work) so delayed_clear still completes even when active checks are turned off.

Copilot uses AI. Check for mistakes.
end
active_check_timer.interval = CHECK_INTERVAL
renew_periodic_lock(shm, key)
Comment on lines 1737 to 1756
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a multi-worker edge case around PERIODIC_LOCK ownership after checker:stop(), but there doesn’t appear to be a regression test covering the disable → re-enable flow described in the PR (one worker stops all checkers, another later creates/starts an active checker and must be able to acquire the periodic lock). Adding a 2-worker Test::Nginx case exercising that flow would help prevent this lock-stuck behavior from reappearing.

Copilot uses AI. Check for mistakes.
else
Expand Down
Loading