tx: detect cycles in forEachPageInternal by ivangsm · Pull Request #1194 · etcd-io/bbolt

ivangsm · 2026-04-18T07:51:06Z

Summary

tx.forEachPageInternal walks a branch page's children without tracking pages it has already entered, so a corrupted db whose branch references one of its own ancestors drives the goroutine stack to overflow. This follows up on the same class of bug that #1193 addresses in the surgeon XRay walker and that @tjungblu called out as a follow-up in #701.

This change threads a visited set through forEachPageInternal and bails on a revisit. fn still fires on the back-edge before we bail, so verifyPageReachable's "multiple references" diagnostic is emitted once per cycle — the walker just stops descending instead of recursing forever.

Related to #701 / #581. Two other recursive walkers in the check path (recursivelyCheckPageKeyOrderInternal and the cursor used by recursivelyCheckBucket) still loop on corrupted pages; they can be addressed as separate follow-ups.

Test plan

New whitebox test TestTx_forEachPage_CycleTerminates that uses surgeon.CopyPage to rewrite a leaf with its branch ancestor's content (so the leaf's child list references itself), then calls tx.forEachPage and asserts the walk is bounded.
Verified the test reproduces the bug on main (stack overflow after ~4M recursive frames).
Targeted tests pass with TEST_FREELIST_TYPE=array and TEST_FREELIST_TYPE=hashmap on ./ + ./internal/tests/... + ./internal/surgeon/... (Stats / Check / RecursivelyCheckPages / new cycle test).

forEachPageInternal walks a branch page's children without tracking pages it has already entered, so a corrupted db whose branch references one of its own ancestors drives the goroutine stack to overflow. Thread a visited set through the recursion and bail on a revisit; fn still fires on the back- edge so verifyPageReachable's "multiple references" diagnostic is emitted once per cycle. Related to etcd-io#701 / etcd-io#581. Signed-off-by: Iván Salazar <ivangio.salazar@gmail.com>

k8s-ci-robot · 2026-04-18T07:51:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ivangsm
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

recursivelyCheckPageKeyOrderInternal walks branch children without tracking pages it has already entered, so a corrupted db whose branch references one of its own ancestors drives the goroutine stack to overflow even after forEachPageInternal (etcd-io#1194) stops looping. Thread a visited set through the recursion, emit a single "page cycle detected" error on revisit, and return so the outer Check continues to surface other diagnostics. Related to etcd-io#701 / etcd-io#581. Signed-off-by: Iván Salazar <ivangio.salazar@gmail.com>

tjungblu · 2026-04-20T06:05:20Z

+	// cannot drive this recursion to stack overflow. fn still runs above,
+	// so verifyPageReachable's "multiple references" diagnostic fires.
+	if _, ok := visited[pgid]; ok {
+		return


the question is what we're going to do here, reading from a corrupted db should definitely be surfaced.

I (personally) would rather side with a panic than changing the function interface to an error.

Good push — I audited the forEachPage callers and you're right that the soft-return is a blind spot for one of them:

tx_check.go:148 (Check): the fn is verifyPageReachable, which already emits page N: multiple references to the error channel on duplicate pgids (tx_check.go:163-165), and tx.check wraps its body in defer recover() that converts panics into panicked{} errors on the same channel (tx_check.go:39-43). Surfacing is fine here under either strategy.

bucket.go:621 (Bucket.Stats): the fn just accumulates counters and recursively re-enters sub-buckets via subStats.Add(b.openBucket(...).Stats()). On a revisit it silently inflates KeyN/LeafPageN/BranchPageN and double-enters sub-buckets. No recover on this path. With the current soft-return, bbolt stats on a cycle-corrupted db prints wrong numbers with no indication — exactly the silent behavior you're flagging.

A few ways forward, in increasing cost:

Panic on cycle (your suggestion). One line in forEachPageInternal. Check stays clean via the existing recover; Stats / bbolt stats crash loudly with the offending pgid. Simplest, matches your preference.

Panic + db.Logger().Errorf first. Same as (1), plus a log entry via the existing Logger interface so the cycle is recorded even if some caller recovers the panic. Marginal, but ~free.

Add Corrupted bool to BucketStats (additive, non-breaking). forEachPage would route the cycle through a sink that Stats reads, so the CLI could print a prominent warning alongside the (inflated) numbers instead of crashing. Strictly the most graceful UX, but it's more scope than the bug this PR is fixing — probably wants its own issue/PR.

My default would be (1), possibly (2). Happy to go straight to (3) in a follow-up if you'd rather keep this PR tight. Which direction do you want?

k8s-ci-robot added the size/M label Apr 18, 2026

ivangsm mentioned this pull request Apr 18, 2026

tx: detect cycles in recursivelyCheckPageKeyOrderInternal #1195

Open

4 tasks

tjungblu reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tx: detect cycles in forEachPageInternal#1194

tx: detect cycles in forEachPageInternal#1194
ivangsm wants to merge 1 commit intoetcd-io:mainfrom
ivangsm:tx/foreachpage-detect-cycles

ivangsm commented Apr 18, 2026

Uh oh!

k8s-ci-robot commented Apr 18, 2026

Uh oh!

tjungblu Apr 20, 2026

Uh oh!

ivangsm Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

ivangsm commented Apr 18, 2026

Summary

Test plan

Uh oh!

k8s-ci-robot commented Apr 18, 2026

Uh oh!

tjungblu Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ivangsm Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants