Skip to content

interop: address strong-consistency review follow-ups (#19561)#20194

Open
wwared wants to merge 7 commits intodevelopfrom
gs/interop-consistency
Open

interop: address strong-consistency review follow-ups (#19561)#20194
wwared wants to merge 7 commits intodevelopfrom
gs/interop-consistency

Conversation

@wwared
Copy link
Copy Markdown
Contributor

@wwared wwared commented Apr 20, 2026

Summary

Closes #19561 — the follow-up issue that captured the non-trivial review feedback on #19505 (interop: add strong consistency guarantees). Each item is its own commit so they can be reviewed and reverted independently.

# Commit What changes
1 interop: replace nil-check with no-op l1Checker in tests byNumberConsistencyChecker becomes the l1ConsistencyChecker interface; concrete prod impl stays as l1ByNumberChecker. Test harness injects a noopL1Checker{}. The if i.l1Checker != nil guard in observeRound is removed — production now requires a real checker.
2 interop: serialize Decision as a string in the WAL Decision round-trips as "wait" / "advance" / "invalidate" / "rewind" via MarshalJSON / UnmarshalJSON. Marshal rejects unknown ints; Unmarshal rejects unknown strings and non-strings. WAL is now resilient to enum re-ordering or new variants.
3 interop: compare deserialized VerifiedResult on idempotent replay VerifiedDB.Commit decodes the existing stored value and compares with reflect.DeepEqual instead of bytes.Equal on JSON output. Removes a Go-version-dependent failure mode where a re-marshal on a newer Go could turn an idempotent replay into ErrAlreadyCommitted. New test writes indented JSON straight into the bucket to exercise the byte-divergent / struct-equal path.
4 interop: add rewind-replay-after-failure test Existing TestPendingTransition_RecoverRewindPreservedOnFailure only covered the first attempt. New TestPendingTransition_RewindReplaysAfterFailure drives the failure → fix → replay loop end-to-end via progressAndRecord and asserts the WAL is cleared and RewindEngine was called on both attempts.
5 interop: add progressAndRecord integration tests TestInterop_FullCycle and most TestProgressInterop* tests bypass the WAL through progressInteropCompat / applyResultCompat shims. Added two new tests that drive the real progressAndRecord path: a multi-cycle advance (asserts WAL cleared after every cycle, logsDB sealed through the latest ts) and an L1-inconsistency → rewind (drives a flipped inconsistentL1Checker{} and asserts the full WAL → apply → clear cycle, with RewindEngine called exactly once). Existing compat-shim tests untouched.
6 interop: pass frontierView through call chain frontierView was a struct field on Interop only non-nil during verify(). Moves it into the call signature: verifyFn and cycleVerifyFn now take view *frontierVerificationView, and verifyInteropMessages, verifyExecutingMessage, verifyCycleMessages accept it explicitly. Lifetime is now visible in the type signatures.
7 interop: hide reorg ops behind InteropChain interface Drops RewindEngine / InvalidateBlock from ChainContainer. New InteropChain interface (in chain_container.go, fronted by a banner-style WARNING) embeds ChainContainer and adds the two reorg-triggering methods. Supernode.chains is map[ChainID]cc.InteropChain; supernode builds a narrower ChainContainer map for heartbeat / supernode / superroot activities so they cannot call the reorg methods at compile time. Removes the four TODO(#19561) "footgun" comments and the dead RewindEngine / InvalidateBlock stubs from the superroot / supernode test mocks.

WAL compatibility note (item 2)

The string-based Decision encoding is not backward-compatible with on-disk WAL entries written by the previous integer encoding — a node restarting with this PR over a pre-existing pending transition will fail to unmarshal it. The intended upgrade path is to ensure no pending transition exists before upgrading (let the node drain its WAL first) or to reset the devnet entirely. This mostly matters for the internal devnet; there is no production deployment yet.

Item 8 (legacy decodeDenyRecords)

Already resolved by #19839 (commit 6cdfde0): the concatenated-32-byte-hash decode path was removed and DenyRecord always carries a DecisionTimestamp. No commit was needed in this PR. After this PR lands the issue can be closed.

Test plan

  • mise exec -- go test ./op-supernode/... (all packages green)
  • mise exec -- ./linter/bin/op-golangci-lint run ./op-supernode/... (0 issues)
  • mise exec -- go build ./... (whole-repo build clean)
  • CI green on this branch

wwared added 7 commits April 20, 2026 21:54
Follow-up from #19505 review: l1Checker was only ever nil in tests, and
the guard in observeRound masked what should be a production invariant.

- Convert byNumberConsistencyChecker from a concrete struct into an
  l1ConsistencyChecker interface. The production implementation keeps
  its by-number semantics under the l1ByNumberChecker name.
- Add noopL1Checker in checker_test.go for tests that do not exercise
  L1 consistency.
- Inject noopL1Checker from the test harness (and the two bare New calls)
  and drop the `if i.l1Checker != nil` guard in observeRound.
Follow-up from #19505 review: Decision was encoded as its bare iota
integer, so reordering variants or inserting a new one would silently
reinterpret any pending WAL entry written by an older binary.

- Add Decision.String, MarshalJSON, and UnmarshalJSON so the enum round
  trips as "wait" / "advance" / "invalidate" / "rewind".
- Reject unknown integers on Marshal and unknown / non-string values on
  Unmarshal so format drift fails loudly instead of silently.
- Add round-trip and rejection tests. Full-stack PendingTransition WAL
  coverage is already exercised by TestVerifiedDB_PendingTransition.
…9561)

Follow-up from #19505 review: VerifiedDB.Commit used bytes.Equal on the
re-marshaled VerifiedResult to detect a crash-recovery replay. Go's
encoding/json is not byte-stable across versions, so a Go upgrade between
the original Commit and the replay could turn a legitimate replay into a
hard ErrAlreadyCommitted.

- Unmarshal the stored bytes and compare the resulting VerifiedResult with
  reflect.DeepEqual on the idempotent-replay path.
- Defer the Marshal until after the sequential / replay checks so we do
  not pay marshal cost on the idempotent path.
- Add TestVerifiedDB_Commit_IdempotentReplay: same-struct, mismatched-
  struct (still errors), and byte-divergent-but-struct-equal (writes
  indented JSON directly to the bucket to simulate the cross-version
  scenario).
Follow-up from #19505 review: TestPendingTransition_RecoverRewindPreservedOnFailure
only covers the first attempt (failure path preserves the WAL). There was no
test for the actual replay once the failing chain is fixed.

Add TestPendingTransition_RewindReplaysAfterFailure:
1. Two chains; chain B's RewindEngine errors on the first attempt.
2. Build + persist a DecisionRewind and call applyPendingTransition —
   assert the exact wrapped error and that the WAL is preserved.
3. Clear chain B's injected failure.
4. Call progressAndRecord so the recovery path picks up the WAL entry
   and drives it to completion.
5. Assert the WAL is cleared and both chains' RewindEngine was invoked
   on both attempts.
Follow-up from #19505 review: TestInterop_FullCycle and most
TestProgressInterop* tests drive the interop loop through
progressInteropCompat / applyResultCompat, which bypass the real WAL
write → apply → clear path. The core correctness property of the
PendingTransition flow was therefore not covered by tests.

- Add inconsistentL1Checker helper in checker_test.go to drive
  observeRound into a rewind decision.
- TestInterop_ProgressAndRecord_MultiAdvance runs three end-to-end
  advance cycles through progressAndRecord and asserts the WAL is
  cleared after every cycle and the logsDB is sealed through to the
  latest timestamp.
- TestInterop_ProgressAndRecord_L1InconsistencyTriggersRewind advances
  twice, swaps the l1Checker to report an inconsistency, and asserts
  the next progressAndRecord drives a full rewind: verifiedDB trimmed,
  WAL cleared, and RewindEngine called exactly once with the target
  timestamp.

Existing compat-shim tests are left unchanged; this is additive coverage.
Follow-up from #19505 review: the frontierView was stored as an
Interop struct field that was only non-nil during verify(), which
made the implicit lifetime invisible in type signatures.

- Extend verifyFn and cycleVerifyFn to accept *frontierVerificationView
  so the view is explicit at every call site.
- Thread it through verifyInteropMessages, verifyExecutingMessage, and
  verifyCycleMessages; stop reading i.frontierView.
- Drop the frontierView struct field and its set/clear dance from verify().
- Update test stubs and the direct algo_test.go call to match.
Follow-up from #19505 review: RewindEngine and InvalidateBlock were
exposed on the broad ChainContainer interface that every activity
holds, with four "this is a footgun" TODOs warning callers off. Move
them behind an interop-owned interface so misuse is caught at compile
time instead of by review.

- Drop RewindEngine and InvalidateBlock from the ChainContainer
  interface in chain_container.go.
- Add a new InteropChain interface in the same file: it embeds
  ChainContainer and adds the two reorg-triggering methods, with a
  banner-style WARNING that only the interop activity may hold it.
- simpleChainContainer satisfies both; NewChainContainer now returns
  InteropChain so supernode can store the wider view.
- Supernode.chains is map[ChainID]cc.InteropChain; supernode builds a
  narrower ChainContainer map for heartbeat / supernode / superroot
  activities so they cannot call the reorg methods at compile time.
- Interop activity stores chains as map[ChainID]cc.InteropChain.
  Test chain maps in interop_test.go and algo_test.go retyped to match.
- Drop dead RewindEngine / InvalidateBlock stubs from the mockCC
  helpers in superroot_test.go and supernode_test.go (no longer
  required by the narrower ChainContainer they implement).
- Remove the four TODO(#19561) "footgun" comments.
@wwared wwared requested a review from a team as a code owner April 20, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

interop: follow-up items from strong-consistency review

1 participant