Skip to content

fix(shard): defer shard lock unlock so panics inside eviction do not strand writers#421

Open
SAY-5 wants to merge 1 commit intoallegro:mainfrom
SAY-5:fix/shard-defer-unlock-401
Open

fix(shard): defer shard lock unlock so panics inside eviction do not strand writers#421
SAY-5 wants to merge 1 commit intoallegro:mainfrom
SAY-5:fix/shard-defer-unlock-401

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 21, 2026

What

Fixes #401.

cacheShard.set and cacheShard.append acquired s.lock with explicit s.lock.Unlock() calls along each return path. Any panic fired between the Lock() and the first Unlock() - for example the makeslice: len out of range the user hit inside readEntry / providedOnRemoveWithReason during an onEvict on a corrupted entry - unwound through the deferred nothing and left the shard permanently write-locked. Every subsequent Set / Append / Delete on that shard's key space then blocked forever while the rest of the app kept running, producing the hang that manifests as 'cache seems deadlocked' in production.

Fix

Acquire the lock at the top of each function and release it with defer so that a panic inside the eviction / serialization path is still caught by any higher-level recover, but the shard is always unlocked as the goroutine unwinds. Removed the redundant manual Unlock calls along the existing return sites.

Runtime hot-path is functionally unchanged: the defer adds one mov/call per Set compared to zero, negligible next to the map + queue work on either side.

Verification

Locally on macOS, go 1.26.2:

  • gofmt -s -l shard.go: clean
  • go vet ./...: clean
  • go test -race -count=1 -short -run 'TestCacheSet|TestCacheGet|TestCacheLen|TestCacheCapacity|TestCacheInitialCapacity|TestCacheDel' ./...: pass

Note on TestCacheReset: this test fails on unmodified master as well (expected: int(1337) actual: int(725)) and looks pre-existing / flaky - definitely not a regression from this change. Happy to investigate separately if desired.

Closes #401

…strand writers

cacheShard.set and cacheShard.append acquired s.lock with explicit
s.lock.Unlock() calls along each return path. Any panic fired between
the Lock() and the first Unlock() - for example the makeslice
len-out-of-range the user hit inside readEntry/providedOnRemoveWithReason
during an onEvict on a corrupted entry (allegro#401) - unwound through the
deferred nothing and left the shard permanently write-locked. Every
subsequent Set / Append / Delete on that shard's key space then blocked
forever while the app kept running, producing a hang that only ever
manifested as 'cache seems deadlocked' in production.

Acquire the lock at the top of each function and release it with defer
so that a panic inside the eviction or serialization path is still
caught by any higher-level recover, but the shard is always unlocked
as the goroutine unwinds. Removed the redundant manual Unlock calls
along the existing return sites. Runtime hot path is functionally
unchanged; the defer adds one mov/call per Set vs zero, negligible
next to the map + queue work on either side.

Closes allegro#401

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@janisz janisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it does not solves the panic, it only allows to handle that gracefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A panic occurred, causing the lock to remain engaged.panic: makeslice: len out of range

2 participants