interop: address strong-consistency review follow-ups (#19561)#20194
Open
interop: address strong-consistency review follow-ups (#19561)#20194
Conversation
Follow-up from #19505 review: l1Checker was only ever nil in tests, and the guard in observeRound masked what should be a production invariant. - Convert byNumberConsistencyChecker from a concrete struct into an l1ConsistencyChecker interface. The production implementation keeps its by-number semantics under the l1ByNumberChecker name. - Add noopL1Checker in checker_test.go for tests that do not exercise L1 consistency. - Inject noopL1Checker from the test harness (and the two bare New calls) and drop the `if i.l1Checker != nil` guard in observeRound.
Follow-up from #19505 review: Decision was encoded as its bare iota integer, so reordering variants or inserting a new one would silently reinterpret any pending WAL entry written by an older binary. - Add Decision.String, MarshalJSON, and UnmarshalJSON so the enum round trips as "wait" / "advance" / "invalidate" / "rewind". - Reject unknown integers on Marshal and unknown / non-string values on Unmarshal so format drift fails loudly instead of silently. - Add round-trip and rejection tests. Full-stack PendingTransition WAL coverage is already exercised by TestVerifiedDB_PendingTransition.
…9561) Follow-up from #19505 review: VerifiedDB.Commit used bytes.Equal on the re-marshaled VerifiedResult to detect a crash-recovery replay. Go's encoding/json is not byte-stable across versions, so a Go upgrade between the original Commit and the replay could turn a legitimate replay into a hard ErrAlreadyCommitted. - Unmarshal the stored bytes and compare the resulting VerifiedResult with reflect.DeepEqual on the idempotent-replay path. - Defer the Marshal until after the sequential / replay checks so we do not pay marshal cost on the idempotent path. - Add TestVerifiedDB_Commit_IdempotentReplay: same-struct, mismatched- struct (still errors), and byte-divergent-but-struct-equal (writes indented JSON directly to the bucket to simulate the cross-version scenario).
Follow-up from #19505 review: TestPendingTransition_RecoverRewindPreservedOnFailure only covers the first attempt (failure path preserves the WAL). There was no test for the actual replay once the failing chain is fixed. Add TestPendingTransition_RewindReplaysAfterFailure: 1. Two chains; chain B's RewindEngine errors on the first attempt. 2. Build + persist a DecisionRewind and call applyPendingTransition — assert the exact wrapped error and that the WAL is preserved. 3. Clear chain B's injected failure. 4. Call progressAndRecord so the recovery path picks up the WAL entry and drives it to completion. 5. Assert the WAL is cleared and both chains' RewindEngine was invoked on both attempts.
Follow-up from #19505 review: TestInterop_FullCycle and most TestProgressInterop* tests drive the interop loop through progressInteropCompat / applyResultCompat, which bypass the real WAL write → apply → clear path. The core correctness property of the PendingTransition flow was therefore not covered by tests. - Add inconsistentL1Checker helper in checker_test.go to drive observeRound into a rewind decision. - TestInterop_ProgressAndRecord_MultiAdvance runs three end-to-end advance cycles through progressAndRecord and asserts the WAL is cleared after every cycle and the logsDB is sealed through to the latest timestamp. - TestInterop_ProgressAndRecord_L1InconsistencyTriggersRewind advances twice, swaps the l1Checker to report an inconsistency, and asserts the next progressAndRecord drives a full rewind: verifiedDB trimmed, WAL cleared, and RewindEngine called exactly once with the target timestamp. Existing compat-shim tests are left unchanged; this is additive coverage.
Follow-up from #19505 review: the frontierView was stored as an Interop struct field that was only non-nil during verify(), which made the implicit lifetime invisible in type signatures. - Extend verifyFn and cycleVerifyFn to accept *frontierVerificationView so the view is explicit at every call site. - Thread it through verifyInteropMessages, verifyExecutingMessage, and verifyCycleMessages; stop reading i.frontierView. - Drop the frontierView struct field and its set/clear dance from verify(). - Update test stubs and the direct algo_test.go call to match.
Follow-up from #19505 review: RewindEngine and InvalidateBlock were exposed on the broad ChainContainer interface that every activity holds, with four "this is a footgun" TODOs warning callers off. Move them behind an interop-owned interface so misuse is caught at compile time instead of by review. - Drop RewindEngine and InvalidateBlock from the ChainContainer interface in chain_container.go. - Add a new InteropChain interface in the same file: it embeds ChainContainer and adds the two reorg-triggering methods, with a banner-style WARNING that only the interop activity may hold it. - simpleChainContainer satisfies both; NewChainContainer now returns InteropChain so supernode can store the wider view. - Supernode.chains is map[ChainID]cc.InteropChain; supernode builds a narrower ChainContainer map for heartbeat / supernode / superroot activities so they cannot call the reorg methods at compile time. - Interop activity stores chains as map[ChainID]cc.InteropChain. Test chain maps in interop_test.go and algo_test.go retyped to match. - Drop dead RewindEngine / InvalidateBlock stubs from the mockCC helpers in superroot_test.go and supernode_test.go (no longer required by the narrower ChainContainer they implement). - Remove the four TODO(#19561) "footgun" comments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #19561 — the follow-up issue that captured the non-trivial review feedback on #19505 (interop: add strong consistency guarantees). Each item is its own commit so they can be reviewed and reverted independently.
interop: replace nil-check with no-op l1Checker in testsbyNumberConsistencyCheckerbecomes thel1ConsistencyCheckerinterface; concrete prod impl stays asl1ByNumberChecker. Test harness injects anoopL1Checker{}. Theif i.l1Checker != nilguard inobserveRoundis removed — production now requires a real checker.interop: serialize Decision as a string in the WALDecisionround-trips as"wait" / "advance" / "invalidate" / "rewind"viaMarshalJSON/UnmarshalJSON. Marshal rejects unknown ints; Unmarshal rejects unknown strings and non-strings. WAL is now resilient to enum re-ordering or new variants.interop: compare deserialized VerifiedResult on idempotent replayVerifiedDB.Commitdecodes the existing stored value and compares withreflect.DeepEqualinstead ofbytes.Equalon JSON output. Removes a Go-version-dependent failure mode where a re-marshal on a newer Go could turn an idempotent replay intoErrAlreadyCommitted. New test writes indented JSON straight into the bucket to exercise the byte-divergent / struct-equal path.interop: add rewind-replay-after-failure testTestPendingTransition_RecoverRewindPreservedOnFailureonly covered the first attempt. NewTestPendingTransition_RewindReplaysAfterFailuredrives the failure → fix → replay loop end-to-end viaprogressAndRecordand asserts the WAL is cleared andRewindEnginewas called on both attempts.interop: add progressAndRecord integration testsTestInterop_FullCycleand mostTestProgressInterop*tests bypass the WAL throughprogressInteropCompat/applyResultCompatshims. Added two new tests that drive the realprogressAndRecordpath: a multi-cycle advance (asserts WAL cleared after every cycle, logsDB sealed through the latest ts) and an L1-inconsistency → rewind (drives a flippedinconsistentL1Checker{}and asserts the full WAL → apply → clear cycle, withRewindEnginecalled exactly once). Existing compat-shim tests untouched.interop: pass frontierView through call chainfrontierViewwas a struct field onInteroponly non-nil duringverify(). Moves it into the call signature:verifyFnandcycleVerifyFnnow takeview *frontierVerificationView, andverifyInteropMessages,verifyExecutingMessage,verifyCycleMessagesaccept it explicitly. Lifetime is now visible in the type signatures.interop: hide reorg ops behind InteropChain interfaceRewindEngine/InvalidateBlockfromChainContainer. NewInteropChaininterface (in chain_container.go, fronted by a banner-style WARNING) embedsChainContainerand adds the two reorg-triggering methods.Supernode.chainsismap[ChainID]cc.InteropChain; supernode builds a narrowerChainContainermap for heartbeat / supernode / superroot activities so they cannot call the reorg methods at compile time. Removes the fourTODO(#19561)"footgun" comments and the deadRewindEngine/InvalidateBlockstubs from the superroot / supernode test mocks.WAL compatibility note (item 2)
The string-based
Decisionencoding is not backward-compatible with on-disk WAL entries written by the previous integer encoding — a node restarting with this PR over a pre-existing pending transition will fail to unmarshal it. The intended upgrade path is to ensure no pending transition exists before upgrading (let the node drain its WAL first) or to reset the devnet entirely. This mostly matters for the internal devnet; there is no production deployment yet.Item 8 (legacy
decodeDenyRecords)Already resolved by #19839 (commit 6cdfde0): the concatenated-32-byte-hash decode path was removed and
DenyRecordalways carries aDecisionTimestamp. No commit was needed in this PR. After this PR lands the issue can be closed.Test plan
mise exec -- go test ./op-supernode/...(all packages green)mise exec -- ./linter/bin/op-golangci-lint run ./op-supernode/...(0 issues)mise exec -- go build ./...(whole-repo build clean)