Skip to content

refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978

Draft
scarmuega wants to merge 12 commits intomainfrom
feat/shard-ewrap-work-units
Draft

refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978
scarmuega wants to merge 12 commits intomainfrom
feat/shard-ewrap-work-units

Conversation

@scarmuega
Copy link
Copy Markdown
Member

Summary

Restructures the EWRAP epoch-boundary pipeline across six commits:

  • Shard EWRAP into prepare/shard/finalize work units (948d472e, pre-existing) — initial split of the monolithic EWRAP into three phases.
  • Rename EwrapPrepareEwrap, seed EpochState.end in ESTART (9627c3ae) — drop the misleading "Prepare" name; ESTART's EpochTransition now opens the end slot.
  • Rename EwrapShardAccountShard (ae260792) — name the per-account work unit by what it processes, not its position in the pipeline.
  • Reorder: AccountShard runs before Ewrap (eaf8e08c) — settle per-account effects first, then run the global Ewrap visitors. EpochEndInit becomes a PATCH delta; EpochEndAccumulate accepts ewrap_progress = None as the natural shard-0 starting state.
  • Merge EwrapFinalize into Ewrap; drop EpochEndInit (a1e0d5cf) — collapse the redundant finalize phase into Ewrap; the wrap-up visitor now assembles the full EndStats (prepare-time fields + shard accumulators) and emits a single EpochWrapUp. One fewer state-machine state, one fewer delta type, one fewer commit cycle. Boundary close is now atomic in a single state-writer commit.
  • Seed EpochState.end in Genesis (5122e075) — restores the invariant that end = Some(...) whenever an AccountShard runs (the first boundary's shards otherwise hit a fresh genesis state with end = None).

Final pipeline: Estart → Roll … → Rupd → Roll … → AccountShard ×N → Ewrap (where Ewrap now both runs the global visitors and closes the boundary).

Test plan

  • cargo check -p dolos-cardano clean
  • cargo test -p dolos-cardano — all 98 unit tests pass (work-buffer state-machine, epoch delta proptests, etc.)
  • cargo build --tests — full workspace compiles
  • cargo test --test memory passes
  • cargo test --test epoch_pots --release — gold-standard end-to-end against DBSync ground truth (requires populated Mithril test instance)
  • Manual bootstrap from Mithril to verify the genesis seeding fix in 5122e075 resolved the original panic

🤖 Generated with Claude Code

scarmuega and others added 6 commits April 24, 2026 17:40
Partitions the epoch-boundary EWRAP work unit — previously one monolithic
work unit that materialised O(active_accounts) in memory (rewards map,
deltas, logs, applied_rewards) — into three phase-specific work units that
each commit independently:

* EwrapPrepare: global classification (pools/dreps/proposals), MIRs,
  enactment + refund visitors for non-account entities, emits EpochEndInit
  seeding EpochState.end with the prepare-time globals and zeroed reward
  accumulators.
* EwrapShard(i): range-scoped (first-byte prefix bucket) load of pending
  rewards + accounts, runs rewards + drops visitors per account, emits
  EpochEndAccumulate with the shard's reward contribution.
* EwrapFinalize: reads the accumulated EpochState.end, emits EpochWrapUp
  (which transitions rolling/pparams snapshots and clears ewrap_progress).

Cross-shard handoff piggy-backs on EpochState rather than a new entity:
ewrap_progress: Option<u32> is the durable cursor and EpochState.end
accumulates across shards via the new deltas.

WorkBuffer gains EwrapShardingBoundary{shard_index, total_shards} and
EwrapFinaliseBoundary states; pop_work now takes ewrap_total_shards from
CardanoConfig (default 16). EpochEndAccumulate has an idempotency guard
keyed on ewrap_progress so shard re-execution after a crash is safe.

Detection-only crash recovery at initialize time logs a warning when
ewrap_progress is set; full block-rehydration resume is flagged as TODO.

Memory tests in tests/memory.rs verify both fjall and redb3 honour
range-scoped iter_entities with O(1) heap — the load-bearing property for
the shard design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n ESTART

Decouple two responsibilities that were tangled in EwrapPrepareWorkUnit:
the global epoch-boundary entity processing (now plain `Ewrap`) and the
structural opening of the `EpochState.end` slot (now done by ESTART's
`EpochTransition`). Ewrap's `EpochEndInit` delta keeps its overwrite
semantics; it now writes into a default-seeded slot rather than from
None.

Also adds `prev_end` / `prev_ewrap_progress` undo fields to
`EpochTransition` (serialized, like the other prev_* fields) so a
rollback after restart correctly restores them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-account leg of the epoch close was named after its position in the
EWRAP pipeline; AccountShard names what it actually does — apply rewards
and pool/drep delegator drops over a key-range slice of the account
namespace.

Also renames the related symbols (BoundaryWork::load_shard /
commit_shard → load_account_shard / commit_account_shard,
WorkBuffer::EwrapShardingBoundary → AccountShardingBoundary,
InternalWorkUnit::EwrapShard → AccountShard). The user-facing
`ewrap_total_shards` config field is intentionally preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line

The epoch-boundary sequence is now AccountShard ×N → Ewrap → EwrapFinalize
(was Ewrap → AccountShard ×N → EwrapFinalize). Per-account work settles
first; the global Ewrap phase then patches the prepare-time fields onto an
EpochState.end that already has its reward accumulators populated.

State machine: WorkBuffer::pop_work transitions reordered, and
on_ewrap_boundary now takes ewrap_total_shards so the restart-at-boundary
entry can construct AccountShardingBoundary directly. The
total_shards == 0 defensive branch now skips to EwrapBoundary (global
phase) instead of EwrapFinaliseBoundary.

Delta semantics:
- EpochEndInit::apply is now a PATCH — writes only the prepare-time
  fields (pool counts, epoch_incentives, MIR amounts, proposal refunds)
  and leaves the accumulator fields alone. ewrap_progress is no longer
  touched by this delta. Dropped the unused prev_ewrap_progress field.
- EpochEndAccumulate::apply treats ewrap_progress = None as the natural
  starting state for shard 0 (unwrap_or(0) as the expected cursor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delta

The boundary close is now a single Ewrap work unit: it runs the global
visitors AND emits EpochWrapUp carrying the assembled final EndStats
(prepare-time fields combined with the AccountShard-populated accumulator
fields). The wrap-up visitor now constructs the final stats locally
instead of routing them through a separate EpochEndInit delta.

Side-benefits: one fewer state-machine state, one fewer delta type, one
fewer commit cycle. Atomicity also improves — the boundary close is now
a single state-writer commit, so a crash between Ewrap and EwrapFinalize
is no longer possible.

Test fixture in tests/epoch_pots/main.rs restructured to match the
post-reorder pipeline: accumulator reset gates on AccountShard
shard_index == 0; rewards CSV is dumped on the Ewrap arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the boundary-pipeline reorder (AccountShard runs before Ewrap),
the first epoch's AccountShard hits `EpochEndAccumulate::apply` with
`entity.end == None` because Genesis bootstrapped the EpochState before
ESTART's `EpochTransition` had a chance to seed the slot. Seed
`end = Some(EndStats::default())` directly in Genesis to match the
invariant ESTART maintains for every subsequent epoch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28803e99-3c54-496c-aeff-35815db3bd08

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/shard-ewrap-work-units

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

scarmuega and others added 6 commits April 25, 2026 11:51
Pulls the AccountShard work unit out of `ewrap/` into a peer module
`ashard/` (matching the layout of `estart/`, `rupd/`, `roll/`,
`genesis/`). The shared `BoundaryWork` / `BoundaryVisitor` infrastructure
and the drops visitor (used by both phases) stay in `ewrap/`; `ashard/`
imports them.

Moves: `rewards.rs`, `shard.rs`, `AccountShardWorkUnit` (from
`work_unit.rs`), and the `load_*` / `commit_*` impl blocks. Visibility
on shared `BoundaryWork` helpers (`new_empty`, `load_pool_data`,
`load_drep_data`, `stream_and_apply_namespace`) widened from private to
`pub(crate)`. The `ending_state` field also widened to `pub(crate)` so
peer modules can mutate it (e.g. `wrapup.flush` already does this).

Method/identifier renames to match the new module path:
- `BoundaryWork::load_account_shard` → `load_ashard`
- `BoundaryWork::commit_account_shard` → `commit_ashard`
- `WorkUnit::name()` returns `"ashard"`

Type name `AccountShardWorkUnit` is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments

Sweeps the docstrings/comments touched in this PR for references to
phases, work units, and deltas that no longer exist after the rename /
reorder / merge / split sequence:

- Restore the in-place explanation for the "rewards before drops"
  HACK in `ashard/loading.rs` (the dangling "see comment on the
  pre-shard path" pointed to a comment that was deleted when the
  prepare phase was removed).
- Drop "prepare phase" / "finalize phase" wording from `BoundaryWork`
  field docstrings, `commit_ewrap` comments, and `loading.rs` section
  dividers — neither phase exists; there's only Ewrap (global + close)
  and AccountShard (per-account).
- Update the ESTART `EpochTransition` description in `work_units.md`
  so it reflects the post-merge data flow: AccountShards populate the
  accumulators directly, then Ewrap reads them back and emits
  `EpochWrapUp` with the final `EndStats` (no `EpochEndInit` patch step
  anymore).
- Rename `compute_prepare_deltas` → `compute_ewrap_deltas`. The
  "prepare" name was a leftover from the `EwrapPrepare` work unit; the
  method is now the only Ewrap-phase compute helper.
- Tighten `load_pending_rewards_range` docstring; flag that the `None`
  branch is currently unused.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shard-related identifiers and comments were named after the legacy
EWRAP pipeline that bundled the global epoch-boundary work and the
per-account shards together. With AccountShard now a distinct work unit
in its own module, those names are misleading. Rename to use the
`ashard` prefix consistently with the module path:

- `CardanoConfig::ewrap_total_shards` → `ashard_total`
- `CardanoConfig::DEFAULT_EWRAP_TOTAL_SHARDS` → `DEFAULT_ASHARD_TOTAL`
- `EpochState::ewrap_progress` → `ashard_progress`
- `prev_ewrap_progress` → `prev_ashard_progress` on `EpochEndAccumulate`,
  `EpochWrapUp`, and `EpochTransition`
- `WorkBuffer::receive_block` / `on_ewrap_boundary` / `pop_work`
  parameter `ewrap_total_shards` → `ashard_total`
- Error messages in `ashard/shard.rs` updated to match.

Also fixes comment / doc misattributions where "EWRAP" was used for
work that's now in `AccountShard`:
- `PendingRewardState` / `DequeueReward` are consumed by `AccountShard`,
  not Ewrap.
- `PendingMirState` / `DequeueMir` are consumed by Ewrap (clarified).
- `AppliedReward` and the `applied_rewards` field are populated during
  AccountShard, not Ewrap.
- RUPD's docstring now says rewards are consumed by `AccountShard`.
- Crash-recovery wording in `lib.rs` says "mid-boundary" instead of
  "mid-EWRAP" since the cursor specifically tracks AccountShard
  progress.

BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set
`ewrap_total_shards` need to rename the key to `ashard_total`. Users
relying on the default (omitted) are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the type and variant names with the module path convention:

- struct `AccountShardWorkUnit` → `AShardWorkUnit`
- enum variant `CardanoWorkUnit::AccountShard` → `AShard`
- enum variant `InternalWorkUnit::AccountShard` → `AShard`
- WorkBuffer state `AccountShardingBoundary` → `AShardingBoundary`
- module re-export and all callers updated to match
- prose / docstrings / log messages also use `AShard` consistently

The module path is `crate::ashard`, so the type now reads as
`ashard::AShardWorkUnit`. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback: the user-facing config name should be self-explanatory
in `dolos.toml`. Renames everywhere for consistency:

- `CardanoConfig::ashard_total` field → `account_shards`
- `CardanoConfig::ashard_total()` accessor → `account_shards()`
- `CardanoConfig::DEFAULT_ASHARD_TOTAL` → `DEFAULT_ACCOUNT_SHARDS`
- WorkBuffer parameters and error messages updated to match.

BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set
this option (under any prior name from this PR) need to use
`account_shards`. Users relying on the default are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against a config change to `account_shards` corrupting an
in-flight boundary. Previously, if dolos crashed mid-boundary and the
operator changed `account_shards` between crash and restart, the resume
would re-partition the account key space with the new count, mismatching
the cursor's already-committed shards.

Fix: snapshot the boundary's shard count into state at the first
`EpochEndAccumulate` apply. The persisted total is authoritative for the
duration of the in-flight boundary; the new config value only takes
effect on the next boundary.

Changes:
- New `AShardProgress { committed, total }` struct stored at
  `EpochState.ashard_progress: Option<AShardProgress>` (was
  `Option<u32>`).
- `EpochEndAccumulate` carries `total_shards`. Its apply validates the
  delta's `total_shards` matches any previously persisted total and
  surfaces an error if they diverge (would only happen if a work unit
  was constructed with a stale config view).
- `EpochWrapUp` and `EpochTransition` undo fields adapted to the new
  type.
- `AShardWorkUnit::load` / `commit_state` read the persisted total when
  present and fall back to `config.account_shards()` for fresh
  boundaries.
- `CardanoLogic` caches `effective_account_shards` (= persisted total
  if a boundary is in flight, else config). Refreshed at every
  `pop_work` call so `receive_block` (which has no state access) can
  use the up-to-date value when constructing
  `WorkBuffer::AShardingBoundary`.
- Crash-recovery wording updated to surface a clear warning when the
  persisted total disagrees with current config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant