fix(shard): defer shard lock unlock so panics inside eviction do not strand writers#421
Open
SAY-5 wants to merge 1 commit intoallegro:mainfrom
Open
fix(shard): defer shard lock unlock so panics inside eviction do not strand writers#421SAY-5 wants to merge 1 commit intoallegro:mainfrom
SAY-5 wants to merge 1 commit intoallegro:mainfrom
Conversation
…strand writers cacheShard.set and cacheShard.append acquired s.lock with explicit s.lock.Unlock() calls along each return path. Any panic fired between the Lock() and the first Unlock() - for example the makeslice len-out-of-range the user hit inside readEntry/providedOnRemoveWithReason during an onEvict on a corrupted entry (allegro#401) - unwound through the deferred nothing and left the shard permanently write-locked. Every subsequent Set / Append / Delete on that shard's key space then blocked forever while the app kept running, producing a hang that only ever manifested as 'cache seems deadlocked' in production. Acquire the lock at the top of each function and release it with defer so that a panic inside the eviction or serialization path is still caught by any higher-level recover, but the shard is always unlocked as the goroutine unwinds. Removed the redundant manual Unlock calls along the existing return sites. Runtime hot path is functionally unchanged; the defer adds one mov/call per Set vs zero, negligible next to the map + queue work on either side. Closes allegro#401 Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
janisz
reviewed
Apr 24, 2026
Collaborator
janisz
left a comment
There was a problem hiding this comment.
Unfortunately it does not solves the panic, it only allows to handle that gracefully.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #401.
cacheShard.setandcacheShard.appendacquireds.lockwith explicits.lock.Unlock()calls along each return path. Any panic fired between theLock()and the firstUnlock()- for example themakeslice: len out of rangethe user hit insidereadEntry/providedOnRemoveWithReasonduring anonEvicton a corrupted entry - unwound through the deferred nothing and left the shard permanently write-locked. Every subsequentSet/Append/Deleteon that shard's key space then blocked forever while the rest of the app kept running, producing the hang that manifests as 'cache seems deadlocked' in production.Fix
Acquire the lock at the top of each function and release it with
deferso that a panic inside the eviction / serialization path is still caught by any higher-level recover, but the shard is always unlocked as the goroutine unwinds. Removed the redundant manualUnlockcalls along the existing return sites.Runtime hot-path is functionally unchanged: the defer adds one mov/call per
Setcompared to zero, negligible next to the map + queue work on either side.Verification
Locally on macOS, go 1.26.2:
gofmt -s -l shard.go: cleango vet ./...: cleango test -race -count=1 -short -run 'TestCacheSet|TestCacheGet|TestCacheLen|TestCacheCapacity|TestCacheInitialCapacity|TestCacheDel' ./...: passNote on
TestCacheReset: this test fails on unmodified master as well (expected: int(1337) actual: int(725)) and looks pre-existing / flaky - definitely not a regression from this change. Happy to investigate separately if desired.Closes #401