Skip to content

tx: detect cycles in recursivelyCheckPageKeyOrderInternal#1195

Open
ivangsm wants to merge 1 commit intoetcd-io:mainfrom
ivangsm:tx/recursively-check-keyorder-cycles
Open

tx: detect cycles in recursivelyCheckPageKeyOrderInternal#1195
ivangsm wants to merge 1 commit intoetcd-io:mainfrom
ivangsm:tx/recursively-check-keyorder-cycles

Conversation

@ivangsm
Copy link
Copy Markdown
Contributor

@ivangsm ivangsm commented Apr 18, 2026

Summary

tx.Check() runs two recursive walkers from checkInvariantProperties:
forEachPage (fixed by #1194) and recursivelyCheckPageKeyOrder. The
second walker does not track visited pages, so a db with a branch page
that references one of its own ancestors still drives the goroutine stack
to overflow during a Check, even with #1194 applied.

This threads a visited map[common.Pgid]struct{} through the recursion.
On a revisit the function emits one page cycle detected at pgId:%d
error to the check channel and returns, so Check finishes and surfaces
any other diagnostics instead of crashing the process.

The accompanying whitebox test reproduces the corruption pattern from
#701 (copy a branch page over a leaf descendant to form a back-edge) and
asserts that the walker terminates with the expected error.

Partially addresses #581. Companion to #1194. Follows tjungblu's
follow-up ask in #701.

A third cycle-sensitive walker remains — the cursor-driven
b.ForEachBucket path inside recursivelyCheckBucket — and is being
tracked separately; it is not a mechanical patch and will get its own
issue/PR.

Test plan

  • go test -run TestTx_recursivelyCheckPageKeyOrder_CycleTerminates passes with the fix and stack-overflows without it.
  • TEST_FREELIST_TYPE=array BBOLT_VERIFY=all go test -short ./... passes.
  • TEST_FREELIST_TYPE=hashmap BBOLT_VERIFY=all go test -short ./... passes.
  • make fmt clean.

recursivelyCheckPageKeyOrderInternal walks branch children without tracking
pages it has already entered, so a corrupted db whose branch references one
of its own ancestors drives the goroutine stack to overflow even after
forEachPageInternal (etcd-io#1194) stops looping. Thread a visited set through the
recursion, emit a single "page cycle detected" error on revisit, and
return so the outer Check continues to surface other diagnostics.

Related to etcd-io#701 / etcd-io#581.

Signed-off-by: Iván Salazar <ivangio.salazar@gmail.com>
@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ivangsm
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

2 participants