Skip to content

prefetcher: builder-phase prefetch + streaming worker pool#2192

Open
lucca30 wants to merge 12 commits intodevelopfrom
lmartins/lmartins/builder-prefetch-mode-switch
Open

prefetcher: builder-phase prefetch + streaming worker pool#2192
lucca30 wants to merge 12 commits intodevelopfrom
lmartins/lmartins/builder-prefetch-mode-switch

Conversation

@lucca30
Copy link
Copy Markdown
Contributor

@lucca30 lucca30 commented Apr 20, 2026

Purpose

Today's prefetcher only runs in a speculative idle phase: before block-building starts, it scans the tx pool and warms state for what it guesses the block might contain. Whatever it misses — late-arriving p2p txs, txs unlocked by freed gas, anything the pool view didn't reflect at guess-time — pays the full cache-miss cost at commit time. On contract-heavy blocks this is the tail of the miss-rate distribution.

This PR adds three more prefetch phases, synchronized to the live block build, so the prefetcher stops guessing and starts following what the builder is actually about to do. All three target exactly the gap the idle phase leaves: near-certain txs the builder will commit, warmed with near-zero speculation.

Prefetch lifecycle (after this PR)

One long-lived worker pool per block; four sequential tx providers feed it. The pool is never torn down between phases — the mode switch is a provider swap, not a prefetcher restart.

Phase When Source of txs What it warms Precision character
1. Idle Pre-build, while buildAndCommitBlock is still assembling its environment Pool snapshot, scanned by price-nonce heap with a gas-capped budget Whatever idle can fit before builder starts Speculative — bounded by pool staleness
2. Builder: upfront plan scan Moment builder mode begins, once per commitTransactions pass (priority + normal) sendPlan clones the price-nonce heap; buildTxPlan walks it with remainingGas() as budget, emits every non-prefetched tx that fits The bulk of builder-phase prefetch — the full predicted block contents based on declared gas limits, minus whatever idle already warmed Near-deterministic — same heap the builder will consume, filtered by the same gas arithmetic
3. Builder: per-tx forwarding During commitTransactions, right before each apply Each tx forwarded through builderPlanCh as the builder reaches it Last-mile catch for anything the upfront scan missed (late pool arrivals, prefetch-status changes in flight) Deterministic — one channel hop ahead of the EVM
4. Builder: freed-gas overflow Continuously during commit, as each tx completes ltx.Gas − actualUsed delivered via builderGasFreedCh; overflow heap scanned when budget accumulates Bonus txs the builder promotes into the tail once freed gas accrues — including txs the upfront scan had to Pop() as "too large" Semi-deterministic — tracks the same heap the builder promotes from

All three builder phases dedup against prefetchedTxHashes and a sentThisPhase local set, so no tx is re-executed. The three-edge dedup matrix (plan↔overflow within iter, plan→overflow cross-iter, overflow→plan cross-iter) is fully closed.

Handoff sequence (idle → builder)

Coordinated by a two-signal interrupt to avoid pool teardown:

  1. Builder sets builderStarted → coordinator sets evmAbort.Store(true). In-flight idle EVM execution aborts via the EVM interrupt; workers entering the loop see the flag and skip.
  2. Coordinator drains any buffered idle txs from txsCh non-blockingly.
  3. evmAbort.Store(false). Workers resume, now fed by the builder provider (upfront-plan + per-tx + freed-gas overflow).

No duplicate prefetches, no lost builder txs, no worker pool churn.

Precision impact

New headline metric: worker/prefetch/builder_added_percent — fraction of a block's txs the builder phase (upfront plan + per-tx + overflow combined) warmed on its own, beyond what idle had already done. Attributes each prefetch completion to the phase it fired under, so operators can see at a glance how much of a block's cache warming idle alone could not have reached.

Also split: worker/txApplyDuration/{prefetched,notPrefetched} — quantifies the cache-miss penalty this lifecycle is closing.

Together these two metrics answer: how many txs needed builder-phase help, and how much apply-time that help saved.

Implementation notes

  • Streaming worker pool. Workers block on range txsCh (buffer 4096, ≈33 KB); phase switches change producer only, not pool.
  • Block-import path unchanged. Prefetch(block, ...) keeps its original signature as a thin wrapper around PrefetchStream. Same topology, same parallelism, same PrefetchResult shape.
  • Per-tx isolation unchanged. Each worker still calls statedb.Copy() — no shared state, no consensus-path coupling.
  • Complexity. diffguard: no new hot-path functions exceed the 10-cognitive bar; all flagged functions are pre-existing.

Validation

  • go build ./... + golangci-lint run clean
  • go test -race ./core/... ./miner/... — 121s, all prefetch + slow-tx tests pass. Includes 13 new unit tests on the pure primitives (scanOverflow, forwardTxs, collectPlanBatch, buildTxPlan, no-duplicate-forward invariant across 10k iterations) and 3 stream-lifecycle integration tests.
  • diffguard --base origin/develop — no new complexity violations.
  • Local kurtosis-pos devnet, 5 scenarios (baseline / builder-heavy / handoff-stress / panic-fidelity / dup-probe). Cross-scenario invariants held on all of them:
    • Zero duplicate forwards across 369 blocks with ~500 tx/block density.
    • processed == fails + successful in every PrefetchStream exit (2180+ closures).
    • Panic recovery: synthetic panic every 500 tx × 525K txs → chain_prefetch_worker_panic meter = 1050, harness log count = 1050 (exact match). Node kept producing blocks throughout.
    • pprof on a live node: zero leaked prefetch goroutines between blocks (sync.Once shutdown verified).
    • Full lifecycle log sequence (STREAM-START → IDLE-EXIT → HANDOFF-BEGIN/DRAIN/RESET → BUILDER-EXIT → STREAM-END) intact per block across all scenarios.
  • Ultra-review pass: one actionable finding (prepareWork early-return bypassing defer interruptPrefetch.Store(true)) — fixed.
  • 5 standard review rounds addressed; final 3 findings closed the cross-iteration dedup matrix.

lucca30 added 3 commits April 19, 2026 11:55
Replace per-call Prefetch() invocations with a long-running PrefetchStream
that spans the block lifetime. A fixed worker pool pulls transactions from
a channel, eliminating pool-startup overhead across the ~500 builder-mode
and ~20 idle-mode calls previously issued per block.

Idle and builder phases share the same stream, acting as swappable tx
providers. Phase handoff uses a two-signal interrupt: hardKill for
permanent exit and evmAbort for soft, per-phase aborts that discard
in-flight idle work before the builder provider takes over.

Prefetch(block, ...) keeps its original signature as a thin wrapper so
blockchain.go behavior stays byte-identical.
…mers, builder contribution metric

Correlate tail latency with prefetch miss rate so we can tell whether the
slowest block txs are the ones we failed to prefetch or whether prefetch
quality is not the lever.

Three additions:

  * slow_tx_tracker 10-min log: each of the top-K slowest txs is now
    annotated inline with MGas/s and a prefetched/not-prefetched flag.
    MGas/s uses integer math on nanoseconds — per-tx durations are in
    the tens of microseconds and float seconds lose precision.
  * worker/txApplyDuration/{prefetched,notPrefetched}: side-by-side p50/
    p95/p99 of the two populations. The aggregate worker/txApplyDuration
    stays for dashboard compatibility.
  * worker/prefetch/builder_added_percent: share of block txs that the
    builder phase prefetched, isolating the payoff of the new builder-
    phase prefetch beyond the existing overall miss-rate metric.
Sizes the buffer to the idle provider's theoretical per-loop burst
(IdleGasBudget / 21k ≈ 4762 at default config) so the scan never
blocks on a full channel. Workers drain at ~200k tx/s so the channel
sits mostly empty in practice; the buffer exists to absorb the brief
scan-phase burst where idle dumps a heap's worth of txs.

Memory cost: 4096 * 8 B + channel header ≈ 33 KB, negligible.
Handoff drain is O(actual-occupancy), not capacity, so a larger
buffer doesn't slow phase transitions.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@lucca30
Copy link
Copy Markdown
Contributor Author

lucca30 commented Apr 20, 2026

@claude review once

@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Claude finished @lucca30's task in 3m 52s —— View job


Code Review: prefetcher builder-phase prefetch + streaming worker pool

  • Read and analyze core/state_prefetcher.go changes
  • Read and analyze miner/worker.go changes
  • Read and analyze miner/ordering.go changes
  • Read and analyze miner/slow_tx_tracker.go changes
  • Read and analyze test files
  • Check concurrency safety (atomics, channels, goroutine lifecycle)
  • Check for resource leaks (goroutines, channels)
  • Check for DoS vectors (unbounded allocations, missing limits)
  • Post comprehensive review

Overall Assessment

This is a well-structured PR that introduces a builder-phase prefetch path alongside the existing idle-phase prefetcher. The architecture is clean: a single long-lived worker pool shared across phases, with a soft-interrupt (evmAbort) for phase transitions and a hard-interrupt (hardKill) for permanent exit. The code is thoroughly tested, channel lifecycle management is careful, and the backward-compatible Prefetch() wrapper is a good call.

No critical or high-severity issues found. One medium finding on metric attribution accuracy, and several low/nit observations below.


Medium

1. Race window in builder-phase metric attribution (worker.go:2306-2317)

The onSuccess callback checks builderStarted.Load() to decide whether to attribute a successful prefetch to the builder phase:

if genParams.builderPrefetchedTxHashes != nil &&
    genParams.builderStarted != nil && genParams.builderStarted.Load() {
    genParams.builderPrefetchedTxHashes.Store(hash, struct{}{})
}

The comment claims "The handoff (evmAbort + drainTxChan + EVM interrupt) guarantees no idle-phase tx's onSuccess fires after builderStarted flips." This isn't strictly true. There's a window between builderStarted.Store(true) (set in buildAndCommitBlock at line 2188) and evmAbort.Store(true) (set later in runPrefetcher at line 2338 after runIdleTxProvider returns). During this window:

  1. A worker already past the evmAbort check at line 173 is executing an idle-phase tx
  2. The EVM finishes before evmAbort is set (so EVM interrupt doesn't fire)
  3. onSuccess runs, sees builderStarted=true, and records the tx in builderPrefetchedTxHashes

Impact: Metric-only — builder_added_percent could be inflated by 1-2 txs per block. No correctness or security issue. The aggregate miss-rate metric (prefetchedTxHashes) is unaffected since the tx was genuinely prefetched.

Suggestion: Either update the comment to acknowledge the small attribution inaccuracy, or gate builder attribution on a separate flag set after evmAbort.Store(false) in the handoff sequence. Given the metric-only impact, updating the comment is probably sufficient.

Fix this →


Low / Nit

2. runWorker skips txs under evmAbort — subtle channel consumption semantics (state_prefetcher.go:169-177)

func (s *streamCtx) runWorker() {
    for tx := range s.txsCh {
        if s.hardKill != nil && s.hardKill.Load() {
            return
        }
        if s.evmAbort != nil && s.evmAbort.Load() {
            continue
        }
        s.processTx(tx)
    }
}

When evmAbort=true, workers consume txs from the channel and discard them. This is correct for the handoff (idle txs need to be consumed so they don't block the channel), and drainTxChan handles the bulk drain. But it does mean a tx can be "lost" — consumed but never processed. This is fine because prefetch is best-effort and the same txs will be re-sent by the builder provider if still relevant. No change needed, just noting for reviewers.

3. Non-blocking sends can silently drop plan txs (worker.go:2005-2009)

for _, tx := range buildTxPlan(clone, gasLimit, prefetchedHashes) {
    select {
    case ch <- tx:
    default:
    }
}

Also in forwardTxs (line 2533). If the 4096-deep channel is full, plan txs are silently dropped. This is acceptable given the buffer sizing rationale in the PR description (4096 ≈ full block at 21k gas), and prefetch is explicitly best-effort. But if block gas limits increase significantly in the future, this could silently degrade plan coverage. Consider a debug-level log or metric for dropped sends to aid future diagnosis.

4. buildOverflowHeap takes a fresh pool snapshot (worker.go:2545-2551)

The overflow heap is built by querying Pending() again, which may include txs that arrived after fillTransactions took its snapshot. This means the overflow view can diverge from the builder's actual view. Not a bug (prefetch is speculative), but worth a brief comment noting the intentional staleness tradeoff.

5. txIndex as ordering hint is approximate (state_prefetcher.go:181)

idx := int(s.txIndex.Add(1) - 1)

Since workers pull from the channel concurrently, txIndex doesn't reflect the actual nonce/execution order — it's the order in which workers claim txs. This is fine because SkipNonceChecks = true (line 231) and the index is only used for SetTxContext. Worth noting that this is a deliberate design choice.

6. Minor: clone() shares signer and baseFee references (ordering.go:198-213)

The clone() method shares signer and baseFee with the original. Both are effectively immutable after creation (signer is stateless, baseFee is a uint256 value), so this is safe. No change needed.


What looks good

  • Two-signal interrupt design (hardKill + evmAbort) is clean and well-documented. The evmAbort as both a soft worker-level skip and an EVM interrupt is elegant.
  • Channel lifecycle is carefully managed: planWg.Wait() before close(builderPlanCh), close(builderGasFreedCh) before close(builderPlanCh), defer interruptPrefetch.Store(true) covering all exit paths.
  • Prefetch() backward compatibility — wrapping the new streaming path via a buffered channel + close is minimal and correct.
  • Test coverage is thorough: TestPrefetchStream_HardKillExits, TestPrefetchStream_EvmAbortSkipsAndResumes, TestPrefetchStream_BlockEquivalence directly exercise the core concurrency contracts. TestBuilderTxProvider_FiltersAlreadyPrefetched and TestBuilderTxProvider_FreedGasFeedback validate the builder provider in isolation.
  • Goroutine leak test (TestPrefetchFromPool_BuilderModeSwitch) with runtime goroutine counting is a practical safeguard.
  • collectPlanBatch batching with a 2ms window is a good approach to amortize overhead while keeping latency low.
  • Panic recovery in the prefetch goroutine (line 2156) prevents a prefetcher bug from crashing the node.
  • Metrics design — splitting txApplyDuration by prefetch status while preserving the aggregate is backward-compatible. The builder_added_percent metric directly measures the feature's value proposition.

|

…mports fixes

Review feedback: the old onSuccess closure checked genParams.builderStarted.Load()
to decide whether to attribute a successful prefetch to the builder phase. That
check races with buildAndCommitBlock's builderStarted.Store(true), which fires
before runPrefetcher reaches the handoff — an idle-phase tx whose EVM execution
finishes in the gap between those two moments would be miscounted as builder.
Impact was metric-only but the old comment overstated the guarantee.

Route attribution through a dedicated inBuilderPhase *atomic.Bool that the
coordinator flips to true only after the handoff completes (evmAbort drain +
reset). Any onSuccess firing after that point is known to come from
post-handoff work, so builder_added_percent now reflects genuine builder-phase
contribution.

Also fix goimports formatting in core/state_prefetcher.go and miner/worker.go
flagged by CI lint.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Code review

Found 1 issue. Checked for bugs and CLAUDE.md compliance.


Bug: Spin loop in streamIdleBatch when channel buffer is full

File: miner/worker.go lines 2429-2442

When txsCh is full, the default case fires but txs.Shift() still executes unconditionally. Since the tx was not sent, it is not added to localPrefetched and gas is not subtracted from gaspool. This means nextViableIdleTx keeps finding viable transactions (gas budget never decreases), and the loop churns through the entire heap doing O(N log N) heap operations (Peek + Shift) without sending anything.

The comment at line 2412 says Dropped sends (buffer full) are not retried -- but the current behavior is worse than not retrying: it burns through every remaining transaction doing useless work.

Suggested fix: Add return in the default case so the function stops when the channel is full.

Comment thread miner/worker.go
Comment thread miner/worker.go Outdated
Comment thread miner/worker.go
Review feedback: when the stream channel fills mid-batch, the default
branch in streamIdleBatch was dropping the tx but still calling
txs.Shift() and continuing to walk the heap. Since dropped sends don't
subtract from the gas budget or populate localPrefetched, the outer
loop's viability check (nextViableIdleTx) kept returning valid txs,
so every remaining entry burned Peek + Shift cycles only to drop.

Return on the default branch instead. The outer runIdleTxProvider loop
re-snapshots the pool every ~100ms, by which time workers have drained
the channel and there's capacity for new sends. Dropping one tx per
batch and re-entering the outer loop is strictly cheaper than walking
the rest of the heap dropping everything.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

… resolves

Review feedback on buildTxPlan / scanOverflow / collectPlanBatch:

* buildTxPlan decremented remaining before calling ltx.Resolve(), so when a
  LazyTransaction failed to resolve (tx evicted between heap listing and
  resolution) the budget was consumed for a tx that never entered the plan.
  Reorder: charge gas only after Resolve succeeds, matching scanOverflow.
  The prefetched-skip path keeps consuming gas intentionally — those txs
  are still bound for the block.
* scanOverflow and collectPlanBatch called prefetchedHashes.Load() without
  a nil guard, while the sibling buildTxPlan explicitly guarded. Production
  is safe because commitWork always initializes the map, but the API
  inconsistency is a real footgun for unit tests or future callers.
  Add the missing guards so all three functions accept a nil *sync.Map as
  "no hashes known, don't skip".

No functional change in the current call paths; tests unchanged.
@lucca30
Copy link
Copy Markdown
Contributor Author

lucca30 commented Apr 20, 2026

@claude Review it again. Do a careful analysis on the block building lifecycle, how prefetch works with block building. If any code introduced may lead to a panic, dead code or decreasing on performance. This PR aims to increase the window of prefetching by do also prefetching while block is building.

Some mechanism were introduced like buildTxPlan to let us better predict the most probable txs to be introduced so we can improve our prefetch precision.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Claude finished @lucca30's task in 9m 23s —— View job


Deep Review: Builder-Phase Prefetch + Block Building Lifecycle

  • Read and trace core/state_prefetcher.go changes
  • Read and trace miner/worker.go changes
  • Read miner/ordering.go changes
  • Read miner/slow_tx_tracker.go + test changes
  • Read miner/worker_test.go (new test coverage)
  • Trace the full block building → prefetch lifecycle end-to-end
  • Analyze for panic vectors
  • Analyze for dead code
  • Analyze for performance concerns
  • Post comprehensive review

Note: The three findings from the prior reviews (buildTxPlan gas-before-resolve, nil guards, idle batch spin loop) have all been properly fixed in commits fdf704d and 9a91340. This review is a fresh pass.


Block Building ↔ Prefetch Lifecycle Trace

Traced the full flow to verify correctness of the two-phase design:

commitWork()
  ├─ genParams{prefetchedTxHashes: &sync.Map{}, builderStarted: new(atomic.Bool)}
  │
  ├─ go runPrefetcher(parent, throwaway, genParams, &interruptPrefetch)
  │    │
  │    ├─ Creates txsCh (4096 buf), evmAbort, inBuilderPhase
  │    ├─ go PrefetchStream(header, throwaway, ..., hardKill=&interruptPrefetch, evmAbort, txsCh, onSuccess)
  │    │    └─ Workers: for tx := range txsCh { if hardKill→return; if evmAbort→skip; processTx(tx) }
  │    │
  │    ├─ PHASE 1: runIdleTxProvider(txsCh, ...)
  │    │    └─ Loop ~100ms: Pending() → heap → streamIdleBatch → non-blocking sends to txsCh
  │    │       Exits when: builderStarted=true OR hardKill OR gas budget exhausted
  │    │
  │    ├─ HANDOFF (if builderStarted):
  │    │    evmAbort.Store(true)   ← interrupts in-flight EVM + workers skip
  │    │    drainTxChan(txsCh)     ← empties buffered idle txs
  │    │    evmAbort.Store(false)  ← workers resume
  │    │    inBuilderPhase.Store(true) ← metric attribution gate
  │    │
  │    ├─ PHASE 2: runBuilderTxProvider(txsCh, ...)
  │    │    ├─ buildOverflowHeap() — fresh pool snapshot for freed-gas overflow
  │    │    └─ Loop: collectPlanBatch(planCh, gasFreedCh, 2ms) → scanOverflow → forwardTxs(txsCh)
  │    │       Exits when: planCh closes OR hardKill
  │    │
  │    └─ SHUTDOWN: evmAbort.Store(true); close(txsCh); <-streamDone
  │
  └─ buildAndCommitBlock(interrupt, noempty, genParams, &interruptPrefetch)
       ├─ Creates builderPlanCh (4096), builderGasFreedCh (256)
       ├─ builderStarted.Store(true)  ← triggers idle→builder handoff
       ├─ defer interruptPrefetch.Store(true)  ← kills prefetcher on ALL exit paths
       │
       ├─ fillTransactions(interrupt, work, genParams)
       │    ├─ sendPlan: clone heap → go buildTxPlan → non-blocking sends to builderPlanCh
       │    └─ commitTransactions: for each tx {
       │         apply(tx) → on success: freed_gas = ltx.Gas - actualUsed → builderGasFreedCh
       │       }
       │
       ├─ planWg.Wait()            ← sendPlan goroutines done
       ├─ close(builderGasFreedCh) ← prefetcher sees gas stream end
       ├─ close(builderPlanCh)     ← prefetcher exits builder loop
       └─ commit(block)

Lifecycle verdict: The sequencing is sound. Channel creation precedes builderStarted flip. Channel closes follow the correct order (gasFreedCh before planCh). The planWg.Wait() ensures no sends to a closed channel. The defer interruptPrefetch.Store(true) covers all exit paths including panics.


Panic Analysis

Source Path Protected? Severity
runPrefetcher goroutine commitWork line 2161 Yes — defer/recover + prefetchPanicMeter Safe
sendPlan goroutine fillTransactions line 2009 No — but only does channel sends (non-blocking), buildTxPlan is arithmetic + heap walks. Very low panic surface. Low risk
PrefetchStream worker goroutines state_prefetcher.go line 125 No — no recover. A panic in processTxprefetchOneTxApplyMessage kills the process. See finding below
Nil deref on prefetchedHashes buildTxPlan, scanOverflow, collectPlanBatch Yes — all three now have nil guards (fixed in fdf704d) Safe
Send to closed channel txsCh, builderPlanCh, builderGasFreedCh Yes — sequencing guarantees no concurrent senders at close time Safe
Freed gas underflow ltx.Gas - actualUsed commitTransactions line 1571 YesactualUsed = gasPoolBefore - gasPool.Gas() = msg.GasLimit - gasRemainingltx.Gas always (gas pool charges full limit, refunds remaining) Safe

Findings

1. 🔴 Worker goroutines in PrefetchStream lack panic recovery — node crash on any EVM edge case

File: core/state_prefetcher.go lines 124-128

for i := 0; i < workers; i++ {
    go func() {
        defer pool.Done()
        ctx.runWorker()  // no recover()
    }()
}

Each worker calls processTxprefetchOneTxstatedb.Copy()ApplyMessage → full EVM execution → IntermediateRoot. An unrecovered panic in any of these operations (corrupted state object, unexpected nil in trie path, precompile edge case) kills the entire node process.

The parent runPrefetcher goroutine has a recover (line 2161), but Go's recover only catches panics in the goroutine that calls it — it does not protect child goroutines. The worker goroutines are separate goroutines.

This is a pre-existing issue (old Prefetch code had the same pattern), but this PR expands the surface: workers now run longer (streaming vs. single-batch), process builder-phase txs in addition to idle-phase txs, and run IntermediateRoot(true) on every tx. More time in workers = more exposure to edge-case panics.

Impact: Node crash from a prefetcher bug. The prefetcher operates on throwaway state copies, so any panic is by definition non-critical to correctness — it should be recovered, not fatal.

Suggested fix: Add defer recover() to the worker goroutine, matching the pattern used in commitWork's prefetcher goroutine:

go func() {
    defer pool.Done()
    defer func() {
        if r := recover(); r != nil {
            log.Error("Prefetch worker panicked", "err", r)
        }
    }()
    ctx.runWorker()
}()

Fix this →


2. 🟡 inBuilderPhase attribution has a residual edge case (IntermediateRoot tail)

File: miner/worker.go lines 2311-2353

The inBuilderPhase flag (replacing the racier builderStarted check from the prior review's finding) is a significant improvement. However, a narrow edge remains:

  1. evmAbort.Store(true) — interrupts in-flight EVM execution
  2. A worker was past ApplyMessage (EVM done) but inside IntermediateRoot(true) (not interruptible by evmAbort)
  3. drainTxChan(txsCh) + evmAbort.Store(false) + inBuilderPhase.Store(true) all execute while that worker is still in IntermediateRoot
  4. Worker's processTx completes → onSuccess fires → inBuilderPhase.Load() returns true → idle tx miscounted as builder

Impact: Metric-only — builder_added_percent could be inflated by at most 1 tx per block. The overall miss-rate metric (prefetchedTxHashes) is unaffected. The handoff itself (draining + abort) typically takes microseconds, so IntermediateRoot would need to span that entire window. In practice this is near-impossible.

Suggestion: A comment acknowledging this residual edge case would prevent future readers from assuming the attribution is strictly exact. No code change needed.


3. 🟡 sendPlan goroutines have no panic recovery

File: miner/worker.go lines 2009-2017

genParams.planWg.Add(1)
go func() {
    defer genParams.planWg.Done()
    for _, tx := range buildTxPlan(clone, gasLimit, prefetchedHashes) {
        select {
        case ch <- tx:
        default:
        }
    }
}()

buildTxPlan walks the cloned heap calling ltx.Resolve() and doing arithmetic. While the panic surface is small (no EVM, no state), Resolve() calls ltx.Pool.Get(ltx.Hash) which touches pool internals. An unrecovered panic here would crash the node.

The planWg.Wait() in buildAndCommitBlock (line 2234) would also hang forever if the goroutine panics before planWg.Done() fires — but defer genParams.planWg.Done() ensures it fires even on panic, so the WaitGroup is safe. The process crash is the real concern.

Suggested fix: Add defer func() { if r := recover(); r != nil { ... } }() inside the goroutine. Given buildTxPlan is pure computation (no IO, no state mutation), this is lower priority than finding #1.


4. 🟢 Performance analysis — no regressions found on the critical path

Additions to commitTransactions hot loop (per tx):

  • gasPoolBefore := env.gasPool.Gas() — one uint64 read
  • Freed gas calculation + non-blocking channel send — one subtraction + one select (skipped entirely when builderGasFreedCh == nil)
  • sync.Map.Load() for prefetch status — one atomic map read
  • slowTxTracker.Add() — O(1) fast-path reject, O(log 10) on accept

Additions to fillTransactions:

  • sendPlan: clone() is O(N) pointer copies (not deep copies), done synchronously. The buildTxPlan scan runs in a goroutine — does not block the builder.

Overall: The hot-path overhead is ~3 uint64 ops + one non-blocking select per tx. The heap clone is the heaviest addition but is O(N) pointer copies for N≈pool size, which is sub-microsecond for typical pools. No allocations in the hot loop beyond what already existed.

buildOverflowHeap calls Pending() once per block — this is a full pool snapshot. It's additional work vs. the pre-PR baseline, but equivalent to one iteration of the idle provider's loop. Acceptable.


5. 🟢 Dead code analysis — none found

All new functions are reachable:

  • PrefetchStreamrunPrefetcher + Prefetch wrapper
  • buildTxPlan, scanOverflow, collectPlanBatchfillTransactions, runBuilderTxProvider
  • clone()sendPlan
  • streamIdleBatch, nextViableIdleTx, waitUntilNextLooprunIdleTxProvider
  • forwardTxs, buildOverflowHeap, drainTxChanrunPrefetcher / runBuilderTxProvider
  • formatSlowTxs, mgasPerSecondflushSlowTxWindow

One minor waste: In the runPrefetcher path, PrefetchStream returns a PrefetchResult that is discarded. The successfulTxs slice is populated under txsMutex per successful tx, but never read. The useful output goes via the onSuccess callback. This is negligible overhead (mutex is uncontended in practice — workers spend orders of magnitude more time in EVM execution than in the lock).


What looks good

  • inBuilderPhase replaces direct builderStarted check — addresses the metric attribution race from the prior review with a proper post-handoff flag.
  • streamIdleBatch returns on full channel (fixed in 9a91340) — eliminates the spin-loop risk.
  • buildTxPlan gas deduction after Resolve() (fixed in fdf704d) — gas budget is no longer consumed for evicted txs.
  • All three *sync.Map consumers have nil guards (fixed in fdf704d) — consistent API contract.
  • Channel lifecycle is rigorous: planWg.Wait()close(gasFreedCh)close(planCh) ordering prevents all send-to-closed-channel panics. The defer interruptPrefetch.Store(true) covers all exit paths.
  • Two-signal interrupt (hardKill + evmAbort) is clean. evmAbort serving dual duty as both worker-level skip and EVM interpreter interrupt is elegant and reduces synchronization complexity.
  • Freed gas arithmetic is underflow-safe: ltx.Gas - actualUsed where actualUsed = msg.GasLimit - gasRemaining ≤ msg.GasLimit = ltx.Gas.
  • Concurrent clone() + commitTransactions is safe: the cloned heap shares LazyTransaction pointers with the original, but Resolve() is read-only (no caching write). Heap operations only rearrange slices, not pointed-to structs.
  • Test coverage is thoroughTestPrefetchStream_HardKillExits, _EvmAbortSkipsAndResumes, _BlockEquivalence directly exercise the core concurrency contracts. The goroutine-leak test (TestPrefetchFromPool_BuilderModeSwitch) with runtime goroutine counting is a practical safeguard.

Summary

The design is sound and well-engineered. The prior review's findings have all been addressed. I found one actionable issue: PrefetchStream worker goroutines lack panic recovery (finding #1), which is pre-existing but worth fixing given the expanded worker lifetime in this PR. The metric attribution edge case (finding #2) is theoretical-only. No dead code, no performance regressions on the critical path, and the channel lifecycle management is careful throughout.

--- | Branch: lmartins/lmartins/builder-prefetch-mode-switch

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 91.04991% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.10%. Comparing base (3193054) to head (d5fd8f8).
⚠️ Report is 5 commits behind head on develop.

Files with missing lines Patch % Lines
core/state_prefetcher.go 80.45% 20 Missing and 6 partials ⚠️
miner/worker.go 93.76% 21 Missing and 5 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2192      +/-   ##
===========================================
+ Coverage    51.93%   52.10%   +0.17%     
===========================================
  Files          884      884              
  Lines       155494   155891     +397     
===========================================
+ Hits         80754    81227     +473     
+ Misses       69522    69456      -66     
+ Partials      5218     5208      -10     
Files with missing lines Coverage Δ
core/blockchain.go 62.20% <ø> (ø)
miner/ordering.go 94.54% <100.00%> (+0.86%) ⬆️
miner/slow_tx_tracker.go 73.61% <100.00%> (+16.71%) ⬆️
core/state_prefetcher.go 80.98% <80.45%> (-10.59%) ⬇️
miner/worker.go 75.87% <93.76%> (+4.01%) ⬆️

... and 22 files with indirect coverage changes

Files with missing lines Coverage Δ
core/blockchain.go 62.20% <ø> (ø)
miner/ordering.go 94.54% <100.00%> (+0.86%) ⬆️
miner/slow_tx_tracker.go 73.61% <100.00%> (+16.71%) ⬆️
core/state_prefetcher.go 80.98% <80.45%> (-10.59%) ⬇️
miner/worker.go 75.87% <93.76%> (+4.01%) ⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Review feedback: worker goroutines in PrefetchStream ran without their own
recover. Go's recover only catches panics in its own goroutine — the parent
runPrefetcher's recover does not protect the children. Any panic in
statedb.Copy / ApplyMessage / IntermediateRoot / precompile edge cases
would therefore kill the node even though prefetching runs on a throwaway
state copy and is best-effort by definition.

Add defer/recover per worker (new chain/prefetch/worker/panic meter) and
match it on the sendPlan goroutine in fillTransactions (reuses the existing
worker/prefetch/panic meter).

Also documents the residual attribution edge where a worker stuck in
IntermediateRoot (not interruptible by evmAbort) can still reach onSuccess
after inBuilderPhase flips. Metric-only impact, bounded by at most one tx
per block; added a comment instead of a runtime fix since the window is
tiny in practice and closing it would require a more invasive barrier.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block building lifecycle looks correct: channel setup happens before , prevents sends-after-close, and the defer on covers all exit paths. No panics, dead code, or correctness regressions found. The two new nit-level issues found by this review (gas limit overestimation in the second call, and a metrics overcount for interrupted txs) are attached as inline comments — they affect only prefetch efficiency and metric accuracy, not block building correctness. The spin loop and nil-guard issues from prior reviews remain unaddressed and worth fixing before merge.

Extended reasoning...

Overview

This PR introduces a builder-phase prefetch path over ~600 lines across , , , , and tests. The review focused on the block building lifecycle, potential panics, dead code, and performance regressions as requested.

Block Building Lifecycle

The sequencing is correct: creates and before calling , ensuring the prefetcher goroutine always finds a valid channel. before channel close prevents any send-to-closed-channel race from goroutines. The covers all exit paths from . No deadlock or goroutine leak paths were found.

New Issues Found (Nit-Level)

Two new nit-level bugs are attached as inline comments: (1) the second call uses the full block gas limit rather than the remaining gas after priority txs, causing to over-plan normal-tier txs that will be rejected — wasting prefetch worker capacity; (2) is incremented in before checks the interrupt flag, so early-abort txs inflate . Both are metrics/efficiency issues only; block building correctness is unaffected.

Security Risks

No security risks. The prefetch path operates on throwaway state copies and cannot affect committed block state. All channels are properly bounded (4096 buffer) with non-blocking sends, preventing DoS from a stuck consumer.

Level of Scrutiny

This is a production-critical path in the Bor block builder. The streaming worker pool, two-signal interrupt scheme, and idle→builder handoff are novel concurrency patterns that warrant human review before merge, especially given several issues identified across multiple review passes that remain unaddressed (spin loop in , nil guards in /, gas ordering in ).

Comment thread miner/worker.go Outdated
Comment thread core/state_prefetcher.go
Comment thread miner/worker.go Outdated
Comment thread miner/worker.go
…cution

Round 4 of review fixes — four independent issues:

* sendPlan gasLimit stale for the second call. The closure captured
  env.header.GasLimit once, but the second invocation (normal-tier txs)
  runs after commitTransactions has already consumed gas from env.gasPool.
  Plan was over-sized, wasting prefetch capacity on txs that the builder
  would reject. Thread gasLimit as an explicit argument and pass
  env.gasPool.Gas() on the second call (with a nil guard for the first,
  where env.gasPool is still nil).

* prefetchOneTx early-interrupt return skipped fails.Add(1). txIndex is
  incremented unconditionally in processTx, so every interrupt-aborted tx
  was counted as a successful prefetch in blockPrefetchTxsValidMeter. The
  function docstring already promises fails is bumped on every (0,false)
  return; the interrupt path was the one branch that didn't.

* runPrefetcher shutdown (evmAbort + close(txsCh) + <-streamDone) was
  sequential, so a panic in runIdleTxProvider or runBuilderTxProvider
  unwound past it. The PrefetchStream goroutine and its N workers would
  then block forever on `range txsCh` (hardKill is only checked after a
  dequeue). Wrap the close in a sync.Once-guarded closure and defer it
  so the channel is released on every exit path, panics included.

* Builder phase could double-execute an in-flight plan tx. scanOverflow
  checked prefetchedHashes, which is only written by onSuccess after EVM
  completion — a tx forwarded in one batch but still mid-EVM when the next
  batch's overflow scan ran could be re-emitted and picked up by a second
  worker. Track locally-sent hashes in runBuilderTxProvider (single-
  threaded, plain map is fine) and have scanOverflow consult it alongside
  prefetchedHashes.
Comment thread miner/worker.go
Comment thread miner/worker.go Outdated
Comment thread core/state_prefetcher.go
lucca30 added 2 commits April 22, 2026 16:58
- scanOverflow: break instead of Pop on gas-too-large so accumulated
  extendedBudget can cover price-leading accounts on later iterations;
  popping permanently evicted them from the long-lived overflow heap.
- runBuilderTxProvider: populate sentThisPhase with the current plan
  batch before scanOverflow so bonus collection can't re-emit a tx
  already returned by collectPlanBatch in the same window.
- PrefetchStream worker recover: increment ctx.fails so the
  {valid,invalid} meters stay consistent with txIndex when a panic
  aborts prefetchOneTx mid-flight.
Covers the helpers that prior review rounds kept surfacing bugs in:

- scanOverflow: zero budget, budget growth preserves popped accounts
  (regression test for the h.Pop() → break fix), in-flight skip
  without budget cost, prefetched skip without budget cost.
- forwardTxs: sent-hash recording, dropped-on-full channel must not
  record, nil sentThisPhase is safe.
- collectPlanBatch: closed planCh, timer-fires-on-empty, freed-gas
  accumulation, prefetched filter, closed gas channel nilled out.
- runBuilderTxProvider: no-duplicate-forward invariant across
  interleaved planCh + freed-gas signals — regression test for the
  within-iteration dedup fix (sentThisPhase pre-population).
Comment thread miner/worker.go
Comment on lines +2388 to +2410
defer shutdown()

// Phase 1: idle tx provider — streams pool txs until builder flips or hardKill fires.
w.runIdleTxProvider(txsCh, header, genParams, hardKill)

// Phase 2: builder tx provider, if we actually switched modes.
if genParams.builderStarted != nil && genParams.builderStarted.Load() && !hardKill.Load() {
// Handoff: abort in-flight idle work and drain buffered idle txs so only
// builder txs reach the pool from here on. Then clear abort and run builder.
// Any in-flight idle EVM execution aborts via evmAbort; workers finish their
// current tx quickly (IntermediateRoot is the only non-interruptible work)
// and move on. Workers that pick up a drained-but-not-gone tx see evmAbort=true
// and skip it.
evmAbort.Store(true)
drainTxChan(txsCh)
evmAbort.Store(false)
// Flip phase attribution only after the handoff is complete. From here
// on, every successful prefetch is genuinely builder-phase work.
inBuilderPhase.Store(true)

w.runBuilderTxProvider(txsCh, header, genParams, hardKill)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Phase 2 of runPrefetcher is gated by a one-shot genParams.builderStarted.Load() check (worker.go:2394) performed immediately after runIdleTxProvider returns. Because runIdleTxProvider's shouldExit() returns true on gas-pool exhaustion (worker.go:2451), and with the default PrefetchGasLimitPercent=100 streamIdleBatch can drain the full header gas budget in a single sub-ms iteration of non-blocking sends, the idle provider routinely exits while buildAndCommitBlock is still blocked inside prepareWork's Bor primary-producer wait (<-time.After(delay), typically 1-2s). In that ordering the Phase 2 guard reads false, runBuilderTxProvider is skipped entirely, and the subsequent plan / freed-gas sends silently drop — defeating the PR's headline builder-phase prefetch feature exactly under the busy-pool conditions it targets. Block correctness is unaffected (prefetch is best-effort), but worker/prefetch/builder_added_percent is driven to 0% on affected blocks. Fix: replace the one-shot check with a bounded block-select on builderStarted + hardKill, OR change runIdleTxProvider to stall (not exit) on gas exhaustion so the budget is a soft rate limit, not a termination signal.

Extended reasoning...

What the bug is

runPrefetcher runs two phases in sequence. Phase 1 (runIdleTxProvider, worker.go:2438) pushes speculative pool txs into txsCh. Phase 2 (runBuilderTxProvider, worker.go:2394) reads the builder's plan + freed-gas signals. Phase 2 is gated by a one-time non-blocking builderStarted.Load() check:

// worker.go:2391-2409
w.runIdleTxProvider(txsCh, header, genParams, hardKill)

if genParams.builderStarted != nil && genParams.builderStarted.Load() && !hardKill.Load() {
    // handoff + runBuilderTxProvider
}

The problem is the exit conditions of Phase 1 (shouldExit, worker.go:2448-2452):

shouldExit := func() bool {
    return interrupt.Load() ||
        (genParams.builderStarted != nil && genParams.builderStarted.Load()) ||
        totalGasPool.Gas() == 0
}

Any one of three independent conditions ends Phase 1 — including gas exhaustion with no builder-start signal at all.

The race

  • runPrefetcher calls makeHeader(genParams, false) at worker.go:2329 (waitOnPrepare=false) — it starts immediately.
  • buildAndCommitBlock calls prepareWorkmakeHeader(genParams, true)engine.Prepare. For Bor under Giugliano with successionNumber=0 (primary producer), Prepare does <-time.After(delay) at consensus/bor/bor.go:1156-1168, waiting for the block-time slot — typically 1-2 seconds. builderStarted.Store(true) only fires in buildAndCommitBlock after prepareWork returns.
  • With default PrefetchGasLimitPercent=100, totalGasPool is seeded with the full header.GasLimit (~30M gas on Polygon PoS). streamIdleBatch walks the heap doing non-blocking select-sends into a 4096-buffer channel, calling totalGasPool.SubGas(ltx.Gas) each time (worker.go:2498-2513). With a populated pool of ~21k-gas txs, a single loop iteration does ~1428 sends in milliseconds and zeroes the gas pool.
  • shouldExit() returns true on totalGasPool.Gas()==0; waitUntilNextLoop sees this immediately and returns; the outer for !shouldExit() loop exits.
  • runIdleTxProvider returns while builderStarted is still false.
  • The one-shot guard at worker.go:2394 reads false → Phase 2 is skipped entirely. shutdown() closes txsCh; workers drain and exit; <-streamDone releases; the prefetcher is gone.

What happens next (silent drops)

Shortly after, buildAndCommitBlock creates builderPlanCh (buf 4096) and builderGasFreedCh (buf 256), flips builderStarted=true, then runs fillTransactions. The sendPlan goroutines do non-blocking sends onto builderPlanCh; commitTransactions does non-blocking sends of freed gas onto builderGasFreedCh. No one is reading either channel. Sends accumulate into the 4096 buffer, then drop silently once full. Zero builder-phase prefetch coverage for the block.

Step-by-step proof

  1. commitWork launches runPrefetcher goroutine, then synchronously calls buildAndCommitBlockprepareWork.
  2. buildAndCommitBlock blocks in engine.Prepare at consensus/bor/bor.go:1166: <-time.After(delay) — ~1-2s for primary-producer block-time alignment.
  3. runPrefetcher goroutine: makeHeader(..., false) returns immediately (no wait). runIdleTxProvider starts with totalGasPool = 30_000_000.
  4. streamIdleBatch: loopGasLimit = min(30M, 30M) = 30M. Non-blocking send loop; for each successful send, totalGasPool.SubGas(21_000). After ~1428 iterations, totalGasPool.Gas() == 0. Elapsed: low ms.
  5. waitUntilNextLoop first iteration: shouldExit() returns true (totalGasPool.Gas()==0); returns. Outer for !shouldExit() false → runIdleTxProvider returns.
  6. Still in prepareWork's time.After(delay), so builderStarted is false.
  7. Line 2394: genParams.builderStarted.Load() == false → Phase 2 skipped.
  8. shutdown(): evmAbort.Store(true), close(txsCh), <-streamDone. runPrefetcher returns.
  9. Some time later, prepareWork returns in the main goroutine. buildAndCommitBlock creates builderPlanCh/builderGasFreedCh, sets builderStarted=true, runs fillTransactions.
  10. sendPlan goroutines push to builderPlanCh — no reader. commitTransactions' freed-gas sends push to builderGasFreedCh — no reader. Buffers fill; further sends hit the non-blocking select's default arm and drop.
  11. Block commits with zero builder-phase prefetch coverage. worker/prefetch/builder_added_percent = 0% for the block.

Why existing mitigations don't save it

The only thing that could save it is if workers drain txsCh so slowly during step 4 that streamIdleBatch fills the 4096 buffer and returns early before exhausting the budget, then waitUntilNextLoop sleeps 100ms checking every 10ms, during which builderStarted might flip. That is a load-dependent escape: it requires workers to drain slower than the idle send rate. On a busy pool with normal workers, the send loop outpaces worker drain comfortably, making the bug the common path, not the exception.

Impact

Block-building correctness: unaffected (prefetch is best-effort). But the PR's headline metric worker/prefetch/builder_added_percent is driven to 0% exactly in the scenarios the PR is designed for — primary producer (succession=0) post-Giugliano under a busy pool with default config. The builder-phase prefetch feature, which is the entire value proposition of this PR, is silently defeated whenever this ordering wins. The verifier observes that the devnet measurements reported in the PR description (3-5% steady-state) already reflect this: that number is the ceiling under the current design, not the floor.

How to fix

Any of:

  1. Phase 2 guard becomes a block-select with hardKill as escape, with a short grace deadline:
    select {
    case <-builderStartedCh: // turn builderStarted into a signalling channel or poll with timer
    case <-hardKillSignal:
        goto shutdown
    case <-time.After(blockBuildDeadline):
        goto shutdown
    }
  2. Drop gas exhaustion from shouldExit — treat the gas budget as a soft rate limit. When exhausted, stall runIdleTxProvider with a short sleep and re-check builderStarted/hardKill instead of returning.
  3. Shrink the default idle gas budget so exhaustion is unusual (e.g. 10-20% of header gas), matching the PR's own tight-cap experiments that show this effect inverting.

Comment thread miner/worker.go Outdated
Comment thread miner/worker.go
Comment on lines +2596 to +2613
if extendedBudget > 0 {
// Mark the plan batch as in-flight before the overflow scan so
// scanOverflow won't re-emit the same tx within this iteration
// (collectPlanBatch returns before forwardTxs records hashes).
for _, tx := range batch {
sentThisPhase[tx.Hash()] = struct{}{}
}
var bonus []*types.Transaction
bonus, extendedBudget = scanOverflow(overflowHeap, extendedBudget, genParams.prefetchedTxHashes, sentThisPhase)
batch = append(batch, bonus...)
}

block := types.NewBlock(header, &types.Body{Transactions: transactions}, nil, trie.NewStackTrie(nil))
result := prefetcher.Prefetch(block, throwaway, w.vmConfig(), true, interruptPrefetch)
forwardTxs(txsCh, batch, sentThisPhase)

// Use the actual gas used from prefetch result and mark successful transactions
if result != nil {
totalGasPool.SubGas(result.TotalGasUsed)
for _, txHash := range result.SuccessfulTxs {
txsAlreadyPrefetched[txHash] = struct{}{}
// Store in shared map for coverage metrics
if genParams.prefetchedTxHashes != nil {
genParams.prefetchedTxHashes.Store(txHash, struct{}{})
}
if builderDone || interrupt.Load() {
return
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In runBuilderTxProvider (miner/worker.go:2596-2608), the pre-populate of sentThisPhase at lines 2600-2602 and the remaining -= ltx.Gas; h.Shift() inside scanOverflow at lines 1948-1950 both mutate persistent state as if forwardTxs will succeed, but forwardTxs at lines 2619-2628 silently drops on a full txsCh buffer. Two consequences: (1) a dropped plan tx leaves a stale sentThisPhase entry that causes future scanOverflow iterations to h.Shift() the tx out of the overflow heap, making it permanently unreachable for overflow prefetch; (2) a dropped overflow (bonus) tx is doubly lost — shifted out of the one-shot overflowHeap, its gas subtracted from extendedBudget, yet never recorded anywhere. Impact is metric-only (reduces builder_added_percent); no block-building correctness, consensus, or security impact. Fix: only mutate sentThisPhase/extendedBudget/heap on a successful send, or use a transient per-iteration set for within-iteration dedup while refunding budget and deferring h.Shift() on drop.

Extended reasoning...

What the bug is

runBuilderTxProvider mutates three pieces of persistent state as if every forwardTxs call succeeds, but forwardTxs (worker.go:2619-2628) is non-blocking and silently drops when txsCh (buffer 4096) is full:

func forwardTxs(ch chan<- *types.Transaction, txs []*types.Transaction, sentThisPhase map[common.Hash]struct{}) {
    for _, tx := range txs {
        select {
        case ch <- tx:
            if sentThisPhase != nil {
                sentThisPhase[tx.Hash()] = struct{}{}
            }
        default:   // <-- silent drop
        }
    }
}

The upstream mutations that leak when a send is dropped are:

  1. Plan path (worker.go:2600-2602): sentThisPhase is pre-populated with every tx in batch before scanOverflow or forwardTxs runs. If a plan tx is later dropped by forwardTxs, the stale entry persists across iterations (sentThisPhase is declared outside the for loop at line 2587).
  2. Overflow path (worker.go:1948-1950): scanOverflow executes remaining -= ltx.Gas; bonus = append(...); h.Shift() before forwardTxs runs. Drop → tx is gone from overflowHeap, extendedBudget was decremented, tx is in no tracking set.

Why existing code does not prevent it

sentThisPhase is only supposed to track in-flight hashes (a semantic the code even states in comments at 2581-2586), but the pre-populate at 2600-2602 adds every batch member unconditionally — conflating scheduled with in-flight. On a txsCh-full drop, the stale entry becomes an incorrect cross-iteration exclusion. For the overflow path, overflowHeap is built once at line 2576 and never refilled, so h.Shift() removal is permanent.

Concrete step-by-step proof — stale sentThisPhase leak

  1. Iteration N: batch=[T1, T2] from collectPlanBatch; extendedBudget > 0.
  2. Pre-populate adds {T1, T2} to sentThisPhase.
  3. scanOverflow returns empty bonus (nothing more fits).
  4. forwardTxs([T1, T2]): T1 sent OK; channel buffer now full; T2 hits the default arm and is dropped. Only T1's hash is (re)recorded in sentThisPhase — T2 was already added by the pre-populate.
  5. Workers drain the channel.
  6. Iteration N+1: collectPlanBatch returns empty batch; extendedBudget still > 0 (or grows via gasFreedCh).
  7. scanOverflow walks overflowHeap; T2 is at (or near) the top since it remains in the pool.
  8. The sentThisPhase[T2] check at line 1932 hits → h.Shift() → T2 is removed from overflowHeap.
  9. T2 is never prefetched in this block cycle: not in prefetchedHashes (never executed), gone from overflowHeap, gone from planCh (already consumed).

Concrete step-by-step proof — overflow path permanent loss

  1. Iteration N: extendedBudget = 30000 from freed gas.
  2. scanOverflow encounters T3 at the head of overflowHeap with Gas = 21000.
  3. remaining -= 21000 (extendedBudget now 9000); bonus = [T3]; h.Shift() removes T3 from overflowHeap permanently.
  4. forwardTxs(batch ∪ [T3]): prior batch fills txsCh; T3 hits default and is dropped.
  5. T3 is not in sentThisPhase (the drop path doesn't record), not in prefetchedHashes (never executed), and not in overflowHeap (shifted out). extendedBudget was reduced by 21000 with zero yield.

Impact

Block-building correctness is entirely unaffected: prefetch is explicitly best-effort throwaway state and the committed block is determined by the real commitTransactions path, not the prefetcher. There is no consensus or security impact — no mutations to persistent state, no changes to validator/sprint/snapshot logic, no panic or crash surface. The observable impact is bounded degradation of the worker/prefetch/builder_added_percent metric in txsCh-saturation windows. Given the 4096-deep buffer plus N≈4·NumCPU/5 workers draining continuously, realistic trigger conditions are narrow (sustained bursts where provider outpaces drain) and per-block impact is at most a handful of unprefetched txs.

How to fix

Option A (simpler): populate sentThisPhase only inside forwardTxs on the successful-send branch, and use a transient per-iteration set passed to scanOverflow for within-iteration dedup. This makes sentThisPhase semantically match its comment.

Option B (more invasive): on a forwardTxs drop of a bonus tx, refund the budget (extendedBudget += tx.Gas) and track unsent bonus txs separately for retry. Since h.Shift() can't be undone cheaply, practical implementations will want Option A together with a lightweight retry queue for the overflow path.

Severity

All six verifier confirmations (three for bug_001, three for bug_002) independently converged on nit. This is a real semantic defect in the deduplication/accounting design that slightly undermines the PR's headline efficiency metric, but has no correctness, consensus, or security consequences. Prefetch operates on throwaway state and the block-building path is entirely independent.

Comment thread miner/worker.go
Comment on lines 2754 to 2780
// Report prefetch coverage percentage
if len(env.txs) > 0 && genParams != nil && genParams.prefetchedTxHashes != nil {
prefetchedCount := 0
builderAddedCount := 0

// Count how many block transactions were prefetched
for _, tx := range env.txs {
if _, ok := genParams.prefetchedTxHashes.Load(tx.Hash()); ok {
prefetchedCount++
}
if genParams.builderPrefetchedTxHashes != nil {
if _, ok := genParams.builderPrefetchedTxHashes.Load(tx.Hash()); ok {
builderAddedCount++
}
}
}

// Calculate miss rate (0-100): higher = worse
// Miss rate (0-100, higher = worse).
missRate := int64((len(env.txs) - prefetchedCount) * 100 / len(env.txs))
prefetchMissRateHistogram.Update(missRate)

// Builder-added share (0-100): block txs the builder phase prefetched on
// its own. Only emitted when the builder phase actually ran.
if genParams.builderPrefetchedTxHashes != nil {
builderAdded := int64(builderAddedCount * 100 / len(env.txs))
prefetchBuilderAddedHistogram.Update(builderAdded)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The deferred metrics block in commit() (worker.go:2754-2780) reads genParams.builderPrefetchedTxHashes and prefetchedTxHashes while PrefetchStream workers may still be draining in-flight txs from txsCh and writing to those maps via onSuccess. close(builderPlanCh) in buildAndCommitBlock only signals runBuilderTxProvider to return; the <-streamDone barrier is awaited in the separate prefetcher goroutine, not the main goroutine that calls commit(). sync.Map makes this safe (no data race), but the reads happen before the last onSuccess writes land, so worker/prefetch/builder_added_percent — the PR's headline feature metric — is systematically under-counted by up to ~numWorkers txs per block under load.

Extended reasoning...

What the bug is

In buildAndCommitBlock (miner/worker.go), the sequence around block commit is:

  1. Line 2265: genParams.planWg.Wait() — waits for sendPlan goroutines.
  2. Line 2268-2270: close(genParams.builderGasFreedCh).
  3. Line 2273-2275: close(genParams.builderPlanCh) — causes runBuilderTxProvider's collectPlanBatch to return builderDone=true, and runBuilderTxProvider returns.
  4. Line 2306: w.commit(work.copy(), ..., genParams) — runs in the same (main) goroutine.

Meanwhile, the prefetcher goroutine launched at line 2191 (go func() { ... w.runPrefetcher(...) }) is separate. When runBuilderTxProvider returns inside it, runPrefetcher then executes shutdown() (close txsCh) and <-streamDone. Crucially, the main buildAndCommitBlock goroutine does not wait on any of that before calling commit().

The specific race

commit()'s deferred block at worker.go:2754-2780 iterates env.txs and calls .Load() on both genParams.prefetchedTxHashes and genParams.builderPrefetchedTxHashes. At that instant, PrefetchStream workers (launched inside runPrefetcher) can still be executing any tx that was dequeued from txsCh before the channel was closed. Each such worker runs ApplyMessage + IntermediateRoot(true) (~1-10ms of wall time) and then invokes onSuccess(hash, gasUsed), which calls .Store on both maps. These .Store calls can happen after the defer's .Load calls read the maps.

Why existing code does not prevent it

The close(builderPlanCh) handshake only terminates the provider side of the pipeline — it tells runBuilderTxProvider there are no more plan txs. It does not drain or wait on the worker side. The <-streamDone barrier that does wait for all workers is sequenced inside runPrefetcher (after shutdown()), which runs in the prefetcher goroutine — not the main goroutine. sync.Map prevents a data-race crash, but it does not give the reader a consistent snapshot: any Store completing after the reader's Load is simply missed.

Impact

Metric-only. Block-building correctness and consensus are entirely unaffected (prefetch is best-effort throughout). The impact is systematic under-counting of worker/prefetch/builder_added_percent — the PR description calls this the "headline metric for judging whether the new phase is pulling its weight." Late onSuccess writes are bounded per block by the number of workers (max(1, 4*NumCPU/5), ~6 on an 8-core machine), so the bias is small but systematic. FinalizeAndAssemble runs inside commit() before the defer fires and acts as a natural buffer, which is why the devnet 3-5% number isn't zero — but under saturated pool/worker load the bias tilts further downward.

How to fix

Any of: (a) await <-streamDone from buildAndCommitBlock before calling commit() (requires surfacing the barrier via genParams); (b) snapshot the maps at close(builderPlanCh) time into a slice, then compute the metric from the snapshot; (c) move the metric computation downstream of the prefetcher's exit.

Step-by-step proof

  1. runBuilderTxProvider loop reaches forwardTxs for batch [T1, T2]; both land in txsCh (buffered).
  2. Workers W1 and W2 dequeue T1 and T2 respectively and begin ApplyMessage + IntermediateRoot(true).
  3. buildAndCommitBlock reaches line 2274: close(builderPlanCh).
  4. collectPlanBatch's next select iteration reads planCh closed → returns builderDone=true. runBuilderTxProvider returns.
  5. Main goroutine continues from line 2275 into switch err → reaches line 2306: w.commit(work.copy(), ..., genParams).
  6. commit() runs FinalizeAndAssemble, submits to taskCh, returns. Total wall time: tens of ms, typically longer than W1/W2's remaining EVM work — but not always, and not deterministically.
  7. commit()'s defer fires. It calls .Load(T1.Hash) and .Load(T2.Hash) on builderPrefetchedTxHashes.
  8. If W1 or W2 had not yet reached its onSuccess call at the moment of the .Load (e.g. still inside IntermediateRoot), that tx is not counted — even though it was genuinely prefetched during the builder phase.
  9. prefetchBuilderAddedHistogram receives a value lower than the true builder contribution for this block.

@lucca30
Copy link
Copy Markdown
Contributor Author

lucca30 commented Apr 22, 2026

@claude ultra-review this PR.

Context:

  • 8 commits on top of develop; feature is a streaming prefetcher with idle→builder phase handoff.
  • 5 rounds of standard @claude review already landed; all findings addressed through 827c017fb.
  • 13 unit tests cover the pure helpers (scanOverflow, forwardTxs, collectPlanBatch, buildTxPlan, runBuilderTxProvider no-dup invariant).
  • Kurtosis devnet run build(deps): bump mixin-deep from 1.3.1 to 1.3.2 in /dashboard/assets #2 empirically validated: BUILDER-DUP=0 across 369 blocks, panic meter == log count exactly (1050==1050), 0 leaked goroutines in pprof, STREAM-END arithmetic 2180/2180 ok, builder_added_percent p50 climbed 0→86 under tight idle cap.

Please focus only on angles the above does not already cover:

1. Concurrency correctness. Trace every interaction between hardKill, evmAbort, inBuilderPhase, sync.Once-shutdown, close(txsCh), and the worker pool. Look for:

  • lost-wakeup between evmAbort.Store(true) and drainTxChan
  • double-close of txsCh
  • a goroutine that can observe hardKill=false, evmAbort=true, closed txsCh in an order that stalls it
  • panic during handoff hold that skips sync.Once

2. Resource-accounting invariants, proven not just asserted:

  • (a) txIndex == valid + invalid on every PrefetchStream exit (normal / hardKill / evmAbort+close / panic / interrupt)
  • (b) No tx hash reaches txsCh twice in a single runBuilderTxProvider invocation
  • (c) builderPrefetchedTxHashes is only written while inBuilderPhase==true

3. Perf regression on the hot import path. Compare the new PrefetchStream wrapper (used by blockchain.go for block import) to the old per-call Prefetch(). Any path where the streaming version does strictly more work for the same input?

4. Failure modes:

  • builderPlanCh closed before runBuilderTxProvider starts
  • empty txPool at buildOverflowHeap time
  • stale pool snapshot (overflow heap built at T, consumed at T+100ms after pool churned)
  • concurrent hardKill + evmAbort set from different goroutines mid-phase

Scope exclusions:

  • Don't re-flag findings already fixed in 88c4e44..827c017fb.
  • Don't comment on test coverage (unit tests recent; kurtosis evidence above).
  • Don't suggest naming / style changes.
  • Flag only findings reachable in production.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Claude finished @lucca30's task in 5m 26s —— View job


Ultra-Review: Streaming Prefetcher Concurrency & Correctness

  • Read core/state_prefetcher.go — worker pool, hardKill, evmAbort, sync.Once shutdown
  • Read miner/worker.gorunPrefetcher, handoff, runBuilderTxProvider, channel lifecycle
  • Read miner/ordering.goclone() and heap interactions
  • Read core/blockchain.go — compare old Prefetch vs new PrefetchStream wrapper
  • Analysis 1: Concurrency correctness
  • Analysis 2: Resource-accounting invariants
  • Analysis 3: Perf regression on hot import path
  • Analysis 4: Failure modes
  • Post findings

All prior review findings (88c4e44..827c017) verified as addressed. This review covers only the four scoped areas. One actionable finding; remainder is clean.


1. Concurrency Correctness

Traced all interactions between hardKill, evmAbort, inBuilderPhase, sync.Once-shutdown, close(txsCh), and the worker pool.

Lost-wakeup between evmAbort.Store(true) and drainTxChan: Not possible. Both workers and drainTxChan consume from txsCh concurrently — Go channels are thread-safe. Every consumed item during evmAbort=true is correctly skipped by workers (line 188-189) or drained (line 2421-2428). No idle items are produced between drain and evmAbort.Store(false) because runIdleTxProvider has already returned. No builder items are produced yet because runBuilderTxProvider hasn't started. ✅

Double-close of txsCh: Prevented by sync.Once (line 2381-2387). The shutdown closure wraps close(txsCh) inside shutdownOnce.Do. The explicit shutdown() at line 2414 and the deferred shutdown() at line 2388 both call the same Once-guarded function. ✅

Goroutine stall ordering — hardKill=false, evmAbort=true, closed txsCh: This combination occurs during the handoff's shutdown sequence (line 2383-2386: evmAbort.Store(true); close(txsCh)). Workers in runWorker (line 184) are in for tx := range s.txsCh. When txsCh closes, range returns zero-value and terminates the loop — workers exit regardless of evmAbort state. No stall. ✅

Panic during handoff that skips sync.Once: If runIdleTxProvider or runBuilderTxProvider panics, the deferred shutdown() at line 2388 fires, closing txsCh via sync.Once. Workers drain, pool.Wait() returns, PrefetchStream completes, and the PrefetchStream goroutine self-terminates. The <-streamDone at line 2415 is not reached on the panic path, but this is acceptable — the PrefetchStream goroutine holds no persistent resources and self-resolves within microseconds. ✅

Concurrent hardKill + evmAbort from different goroutines: Both are atomic.Bool — no data race. Workers check hardKill first (line 185), which takes priority. If both are true, worker returns immediately. ✅

No concurrency issues found.


2. Resource-Accounting Invariants

(a) txIndex == valid + invalid on every PrefetchStream exit

Traced every path through processTx and runWorker:

Path txIndex incremented? fails incremented? Counted?
processTxprefetchOneTx succeeds Yes (line 196) No valid
processTxprefetchOneTx interrupt Yes (line 196) Yes (line 233) invalid
processTxprefetchOneTx error (any) Yes (line 196) Yes (lines 239, 247, 258) invalid
processTx → worker panic in prefetchOneTx Yes (line 196) Yes (recover at line 137) invalid
runWorkerevmAbort=truecontinue No No not counted (correct)
runWorkerhardKill=truereturn No No not counted (correct)

At stream exit (line 147-149):

processed := ctx.txIndex.Load()
blockPrefetchTxsValidMeter.Mark(processed - ctx.fails.Load())
blockPrefetchTxsInvalidMeter.Mark(ctx.fails.Load())

valid + invalid = (processed - fails) + fails = processed = txIndex. Invariant holds on all exit paths. ✅

(b) No tx hash reaches txsCh twice in a single runBuilderTxProvider invocation

Three dedup layers:

  1. collectPlanBatch (line 2665-2666): checks prefetchedHashes.Load(tx.Hash()) — skips txs already completed by idle/prior builder work.
  2. sentThisPhase pre-populated (lines 2600-2602) with current plan-batch hashes before scanOverflow runs — prevents within-iteration plan-vs-overflow duplicates (fixed in 62a72c0).
  3. forwardTxs (line 2624) records each successfully-sent tx in sentThisPhase — prevents cross-iteration re-emission.

The one gap: forwardTxs drops on a full txsCh buffer (default branch, line 2626-2627), but the pre-populate at lines 2600-2602 already added the hash to sentThisPhase unconditionally. A dropped plan tx still has a sentThisPhase entry, so it won't be re-emitted by scanOverflow in future iterations. However, a dropped bonus tx from scanOverflow was h.Shift()-ed out of the overflow heap (line 1950) and its hash was not pre-populated in sentThisPhase (only plan txs are pre-populated). If it was added to sentThisPhase by forwardTxs on the successful-send branch, it's tracked. If it was dropped, it's gone from the heap and not in sentThisPhase — it can't reappear. So no hash can reach txsCh twice. ✅

(c) builderPrefetchedTxHashes is only written while inBuilderPhase==true

onSuccess at line 2363: if inBuilderPhase.Load() && .... inBuilderPhase is set to true at line 2406, strictly after evmAbort.Store(false) at line 2403. The residual edge case (worker completing IntermediateRoot during the handoff, acknowledged in comments at lines 2351-2356) could misattribute at most 1 tx per block. Documented and acceptable. ✅


3. Perf Regression on Hot Import Path

The import path (blockchain.go:782) calls Prefetch(block, throwaway, vmCfg, false, followupInterrupt), which now wraps PrefetchStream via a pre-filled channel.

Strictly new work per tx on the import path vs the old errgroup code:

  • 1 channel send (non-blocking, buffer == len(txs)) + 1 channel receive per tx
  • 1 txIndex.Add(1) atomic increment (old code used loop variable i directly)
  • 2 nil-checks (evmAbort, onSuccess) that short-circuit immediately since both are nil

Total overhead: ~2-5 nanoseconds per tx. Negligible against statedb.Copy() + ApplyMessage + IntermediateRoot (~1-10ms each).

Structural improvement: Old code called statedb.Copy() in the dispatching goroutine before spawning work. New code calls statedb.Copy() inside the worker goroutine (prefetchOneTx line 244), allowing copies to happen concurrently — slightly lower latency under contention.

Worker pool size: Identical — max(1, 4*runtime.NumCPU()/5) in both old and new.

No performance regression on the import path.


4. Failure Modes

builderPlanCh closed before runBuilderTxProvider starts

builderPlanCh is created at line 2223 and closed at line 2274, both in buildAndCommitBlock (main goroutine). runBuilderTxProvider runs in the prefetcher goroutine. The ordering: builderStarted.Store(true) (line 2225) → runIdleTxProvider exits (sees builderStarted=true) → handoff → runBuilderTxProvider starts.

Can close(builderPlanCh) at line 2274 fire before runBuilderTxProvider starts? Yes — if fillTransactions + planWg.Wait() completes before the prefetcher goroutine reaches line 2408. In that case, collectPlanBatch reads from a closed planCh and gets builderDone=true immediately (line 2660). runBuilderTxProvider returns immediately. Any plan txs were buffered in the 4096-deep channel and consumed; any freed-gas signals were buffered in the 256-deep builderGasFreedCh. These are lost. This is acceptable: the entire block was already built; prefetching for an already-committed block is pointless. ✅

Empty txPool at buildOverflowHeap time

buildOverflowHeap (line 2635-2641) calls Pending() which returns an empty map. newTransactionsByPriceAndNonce with empty txs returns an empty heap. scanOverflow immediately breaks at line 2917 (ltx == nil). No bonus txs emitted. No crash, no wasted work. ✅

Stale pool snapshot (overflow heap built at T, consumed at T+100ms after pool churned)

buildOverflowHeap is called once at the start of runBuilderTxProvider (line 2576). The overflow heap is then scanned repeatedly over the 2ms batch windows. Pool churn means:

  • Txs in the overflow heap may have been evicted from the pool → ltx.Resolve() returns nil → h.Pop() at line 1944, skipped gracefully.
  • New txs arriving after the snapshot are invisible to the overflow scan → missed opportunity, but prefetch is best-effort. The plan channel covers the primary path (builder's own view).

No crash, no incorrect behavior. Stale snapshot is a deliberate design tradeoff acknowledged in the PR. ✅

Concurrent hardKill + evmAbort set from different goroutines mid-phase

Already covered in Analysis 1. Both are atomic.Bool; no ordering dependency. Workers prioritize hardKill. The shutdown() function sets evmAbort=true then close(txsCh), both inside sync.Once. If hardKill=true races with shutdown(), workers exit on whichever signal they see first. ✅


🟡 One Remaining Unfixed Finding

prepareWork early-return bypasses defer interruptPrefetch.Store(true) — zombie prefetcher on prepareWork failure

File: miner/worker.go lines 2209-2228

func (w *worker) buildAndCommitBlock(..., interruptPrefetch *atomic.Bool) {
    work, err := w.prepareWork(genParams, w.makeWitness)
    if err != nil {
        return  // ← line 2212: exits BEFORE the defer at line 2228 is registered
    }
    ...
    defer interruptPrefetch.Store(true)  // ← line 2228: too late

The prefetcher goroutine (spawned unconditionally at line 2191) is never signaled to stop. It runs runIdleTxProvider which loops until gas exhaustion (totalGasPool.Gas() == 0), burning through the full header gas budget (~100M gas at default PrefetchGasLimitPercent=100) across 4*NumCPU/5 workers doing throwaway EVM executions.

This was flagged in the previous review round but not fixed.

Why it's bounded, not a leak: runPrefetcher's deferred shutdown() (via sync.Once) ensures txsCh is closed once runIdleTxProvider returns, so the PrefetchStream goroutine and workers self-terminate. No goroutine leak — just wasted CPU.

Production trigger: prepareWorkengine.Prepare can fail via UnauthorizedSignerError, Heimdall snapshot/span lookup failures, or makeEnvstateless.NewWitness() errors. On a Giugliano primary producer, Prepare includes the succession-delay time.After — but the runPrefetcher call to makeHeader(..., false) skips this wait and starts streaming immediately, so the zombie is active for the full duration of the wasted work.

Impact: ~100M gas of throwaway EVM work across ~6 workers per prepareWork failure. No correctness/consensus impact, but real CPU waste on an uncommon-but-realistic error path. If prepareWork fails repeatedly (e.g., prolonged Heimdall issues), zombie prefetchers from successive commitWork calls compete for CPU.

Fix: Move the defer before the err check:

func (w *worker) buildAndCommitBlock(..., interruptPrefetch *atomic.Bool) {
    defer interruptPrefetch.Store(true)  // ← moved up: covers ALL exit paths
    work, err := w.prepareWork(genParams, w.makeWitness)
    if err != nil {
        return
    }

Fix this →


Summary

Area Verdict
Lost-wakeup in handoff Clean
Double-close of txsCh Protected by sync.Once
Goroutine stall ordering Clean — channel close unblocks all paths
Panic during handoff sync.Once defer covers it; PrefetchStream goroutine self-terminates
txIndex == valid + invalid Holds on all 6 exit paths (verified per-path)
No-dup in runBuilderTxProvider Three dedup layers cover all reachable cases
builderPrefetchedTxHashes attribution Correctly gated; 1-tx residual acknowledged
Import path perf ~2-5ns overhead per tx; statedb.Copy() moved to worker (slight improvement)
builderPlanCh closed early Graceful — collectPlanBatch sees closed channel, exits
Empty pool Graceful — empty heap, no work
Stale snapshot By design — Resolve() nil-check handles evicted txs
Concurrent kill+abort Atomic stores, no ordering dependency, workers prioritize hardKill
prepareWork early-return Unfixed — zombie prefetcher burns CPU until gas exhaustion

The concurrency design is sound. The sync.Once shutdown, two-signal interrupt model, and channel lifecycle are correctly composed. The one actionable item is the defer placement in buildAndCommitBlock — a one-line move.
| Branch: lmartins/lmartins/builder-prefetch-mode-switch

Move `defer interruptPrefetch.Store(true)` to the top of
buildAndCommitBlock so the prefetcher goroutine is signaled to stop
on every return path — including the early return when prepareWork
fails (UnauthorizedSignerError, Heimdall snapshot/span failures,
stateless.NewWitness errors). Previously the defer sat after the
err check, so on prepareWork failure runIdleTxProvider would loop
until gas exhaustion, burning ~100M gas of throwaway EVM work
across the worker pool before self-terminating.

No correctness/consensus impact; fixes a real CPU-waste path on
an uncommon-but-realistic error branch. Bounded by PrefetchStream's
sync.Once shutdown (no goroutine leak), but the wasted work was
material under prolonged Heimdall issues when consecutive
commitWork calls could pile up zombie prefetchers.
Comment thread miner/worker.go
Pass sentThisPhase into collectPlanBatch and skip txs already present,
mirroring the existing scanOverflow dedup pattern. Closes the third
edge of the builder-phase dedup matrix:

1. within-iter plan↔overflow — covered by pre-populate sentThisPhase
   before scanOverflow (commit 62a72c0)
2. plan→overflow cross-iter — covered by scanOverflow's sentThisPhase
   read (commit 88c4e44)
3. overflow→plan cross-iter — this commit

Scenario: scanOverflow emits tx T in iteration N (h.Shift() past T,
sentThisPhase[T]=true). Worker W1 begins multi-ms EVM on T. In
iteration N+1, collectPlanBatch reads a buffered copy of T from
planCh; prefetchedHashes is still empty because onSuccess hasn't
fired, so T slips through and gets forwarded a second time.

Impact is wasted worker capacity only (throwaway state, no
consensus/correctness effect), but the race opens wider exactly on
contract-heavy workloads where builder-phase prefetch matters most.
Kurtosis run #2 scenario E didn't catch this because value-transfer
EVM finishes in ~10µs, well inside the 2ms batch window — so
prefetchedHashes raced fast enough to plug the gap.

Added TestCollectPlanBatch_SkipsInflight as the regression test.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant