Skip to content

fix(op-node): probe engine finalized head on EL-sync startup#20139

Draft
sebastianst wants to merge 1 commit intodevelopfrom
seb/fix/fcu-nudge-during-el-sync
Draft

fix(op-node): probe engine finalized head on EL-sync startup#20139
sebastianst wants to merge 1 commit intodevelopfrom
seb/fix/fcu-nudge-during-el-sync

Conversation

@sebastianst
Copy link
Copy Markdown
Member

@sebastianst sebastianst commented Apr 17, 2026

Summary

Fixes the stall described in #18468: when op-node is configured with --syncmode=execution-layer, NewEngineController unconditionally initializes syncStatus = syncStatusWillStartEL. After a restart against an already-synced engine, op-node sits in that state forever: SyncStep backs off on isEngineInitialELSyncing(), and the only place that transitions out of syncStatusWillStartEL is insertUnsafePayload — which requires a fresh unsafe payload to arrive. If the sequencer is only gossipping blocks the engine already has, op-node and reth deadlock (reth logs no consensus updates received for a while; op-node logs unsafe=0 safe=0 el_syncing=true indefinitely).

Approach

  • Add MaybeSkipELSyncIfEngineAlreadySynced on EngineController: while syncStatus == syncStatusWillStartEL, probe the engine's finalized head. If it's a non-genesis block (and SupportsPostFinalizationELSync is not set), transition directly to syncStatusFinishedEL and emit ResetEngineRequestEvent so op-node's in-memory heads are populated via FindL2Heads. The emit happens after the mutex is released, because ResetEngineRequestEvent's handler re-acquires the same lock via OnEvent.
  • Call it at the top of SyncStep before the EL-sync backoff. It's a no-op once syncStatus has transitioned, so it runs at most a few times on startup.
  • Factor the finalized-head check (previously inline in insertUnsafePayload's syncStatusWillStartEL branch) into checkEngineAlreadySynced, used by both call sites.

Scope notes

  • The original concern that motivated this investigation — FCU-reduction from feat(op-node): batch safe-head FCU calls to one per derived L1 block #19638 affecting syncing engines — reduces to this startup bug once you trace it. In CLSync mode, syncStatus never transitions into any EL-syncing state, so isEngineInitialELSyncing() is always false there and the FCU-reduction changes don't alter the gossip→FCU path. The real bug is the startup-stall in ELSync mode, which this PR addresses.
  • Not in scope here: engines that internally decide to snap-sync (e.g. op-reth with a blank DB) while op-node runs in CLSync mode. op-node has no direct signal for that today. A follow-up would need op-node to inspect engine FCU responses (SYNCING vs VALID) to track engine-initiated sync state.

Test plan

  • go build ./op-node/... — clean
  • go vet ./op-node/... — clean
  • go test ./op-node/rollup/engine/... — pass (new TestMaybeSkipELSyncIfEngineAlreadySynced with four sub-tests)
  • go test ./op-node/rollup/{driver,derive,attributes,sequencing,finality}/... — pass
  • op-e2e/actions/sync/... — requires forge-artifacts build; deferred to CI

Fixes #18468

🤖 Generated with Claude Code

When op-node is configured with --syncmode=execution-layer, NewEngineController
unconditionally initializes syncStatus to syncStatusWillStartEL. After a
restart against an already-synced engine, op-node stalls there indefinitely:
SyncStep backs off on isEngineInitialELSyncing(), and the only place that
transitions out of syncStatusWillStartEL is insertUnsafePayload — which
requires a fresh unsafe payload that may not arrive (the sequencer gossips
blocks the engine already has).

Add MaybeSkipELSyncIfEngineAlreadySynced: a startup guard that queries the
engine's finalized head while in syncStatusWillStartEL. If the engine is
already synced (non-genesis finalized head, and
SupportsPostFinalizationELSync is not set), transition directly to
syncStatusFinishedEL and emit ResetEngineRequestEvent so op-node's
in-memory heads are populated via FindL2Heads. The emit happens after the
mutex is released to avoid re-entering OnEvent under the lock.

Call it at the top of SyncStep before the EL-sync backoff — it's a no-op
once syncStatus has transitioned, so it runs at most a few times on
startup.

The finalized-head check that drives both startup probing and the
existing transition inside insertUnsafePayload is factored out into
checkEngineAlreadySynced.

Fixes #18468

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sebastianst sebastianst force-pushed the seb/fix/fcu-nudge-during-el-sync branch from e451773 to 90669a8 Compare April 17, 2026 13:37
@sebastianst sebastianst changed the title fix(op-node): nudge engine with FCU on gossip during initial EL sync fix(op-node): probe engine finalized head on EL-sync startup Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

op-node stuck in EL sync mode after restart when reth already has synced data

1 participant