refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978
Draft
refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978
Conversation
Partitions the epoch-boundary EWRAP work unit — previously one monolithic
work unit that materialised O(active_accounts) in memory (rewards map,
deltas, logs, applied_rewards) — into three phase-specific work units that
each commit independently:
* EwrapPrepare: global classification (pools/dreps/proposals), MIRs,
enactment + refund visitors for non-account entities, emits EpochEndInit
seeding EpochState.end with the prepare-time globals and zeroed reward
accumulators.
* EwrapShard(i): range-scoped (first-byte prefix bucket) load of pending
rewards + accounts, runs rewards + drops visitors per account, emits
EpochEndAccumulate with the shard's reward contribution.
* EwrapFinalize: reads the accumulated EpochState.end, emits EpochWrapUp
(which transitions rolling/pparams snapshots and clears ewrap_progress).
Cross-shard handoff piggy-backs on EpochState rather than a new entity:
ewrap_progress: Option<u32> is the durable cursor and EpochState.end
accumulates across shards via the new deltas.
WorkBuffer gains EwrapShardingBoundary{shard_index, total_shards} and
EwrapFinaliseBoundary states; pop_work now takes ewrap_total_shards from
CardanoConfig (default 16). EpochEndAccumulate has an idempotency guard
keyed on ewrap_progress so shard re-execution after a crash is safe.
Detection-only crash recovery at initialize time logs a warning when
ewrap_progress is set; full block-rehydration resume is flagged as TODO.
Memory tests in tests/memory.rs verify both fjall and redb3 honour
range-scoped iter_entities with O(1) heap — the load-bearing property for
the shard design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n ESTART Decouple two responsibilities that were tangled in EwrapPrepareWorkUnit: the global epoch-boundary entity processing (now plain `Ewrap`) and the structural opening of the `EpochState.end` slot (now done by ESTART's `EpochTransition`). Ewrap's `EpochEndInit` delta keeps its overwrite semantics; it now writes into a default-seeded slot rather than from None. Also adds `prev_end` / `prev_ewrap_progress` undo fields to `EpochTransition` (serialized, like the other prev_* fields) so a rollback after restart correctly restores them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-account leg of the epoch close was named after its position in the EWRAP pipeline; AccountShard names what it actually does — apply rewards and pool/drep delegator drops over a key-range slice of the account namespace. Also renames the related symbols (BoundaryWork::load_shard / commit_shard → load_account_shard / commit_account_shard, WorkBuffer::EwrapShardingBoundary → AccountShardingBoundary, InternalWorkUnit::EwrapShard → AccountShard). The user-facing `ewrap_total_shards` config field is intentionally preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line The epoch-boundary sequence is now AccountShard ×N → Ewrap → EwrapFinalize (was Ewrap → AccountShard ×N → EwrapFinalize). Per-account work settles first; the global Ewrap phase then patches the prepare-time fields onto an EpochState.end that already has its reward accumulators populated. State machine: WorkBuffer::pop_work transitions reordered, and on_ewrap_boundary now takes ewrap_total_shards so the restart-at-boundary entry can construct AccountShardingBoundary directly. The total_shards == 0 defensive branch now skips to EwrapBoundary (global phase) instead of EwrapFinaliseBoundary. Delta semantics: - EpochEndInit::apply is now a PATCH — writes only the prepare-time fields (pool counts, epoch_incentives, MIR amounts, proposal refunds) and leaves the accumulator fields alone. ewrap_progress is no longer touched by this delta. Dropped the unused prev_ewrap_progress field. - EpochEndAccumulate::apply treats ewrap_progress = None as the natural starting state for shard 0 (unwrap_or(0) as the expected cursor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delta The boundary close is now a single Ewrap work unit: it runs the global visitors AND emits EpochWrapUp carrying the assembled final EndStats (prepare-time fields combined with the AccountShard-populated accumulator fields). The wrap-up visitor now constructs the final stats locally instead of routing them through a separate EpochEndInit delta. Side-benefits: one fewer state-machine state, one fewer delta type, one fewer commit cycle. Atomicity also improves — the boundary close is now a single state-writer commit, so a crash between Ewrap and EwrapFinalize is no longer possible. Test fixture in tests/epoch_pots/main.rs restructured to match the post-reorder pipeline: accumulator reset gates on AccountShard shard_index == 0; rewards CSV is dumped on the Ewrap arm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the boundary-pipeline reorder (AccountShard runs before Ewrap), the first epoch's AccountShard hits `EpochEndAccumulate::apply` with `entity.end == None` because Genesis bootstrapped the EpochState before ESTART's `EpochTransition` had a chance to seed the slot. Seed `end = Some(EndStats::default())` directly in Genesis to match the invariant ESTART maintains for every subsequent epoch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pulls the AccountShard work unit out of `ewrap/` into a peer module `ashard/` (matching the layout of `estart/`, `rupd/`, `roll/`, `genesis/`). The shared `BoundaryWork` / `BoundaryVisitor` infrastructure and the drops visitor (used by both phases) stay in `ewrap/`; `ashard/` imports them. Moves: `rewards.rs`, `shard.rs`, `AccountShardWorkUnit` (from `work_unit.rs`), and the `load_*` / `commit_*` impl blocks. Visibility on shared `BoundaryWork` helpers (`new_empty`, `load_pool_data`, `load_drep_data`, `stream_and_apply_namespace`) widened from private to `pub(crate)`. The `ending_state` field also widened to `pub(crate)` so peer modules can mutate it (e.g. `wrapup.flush` already does this). Method/identifier renames to match the new module path: - `BoundaryWork::load_account_shard` → `load_ashard` - `BoundaryWork::commit_account_shard` → `commit_ashard` - `WorkUnit::name()` returns `"ashard"` Type name `AccountShardWorkUnit` is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments Sweeps the docstrings/comments touched in this PR for references to phases, work units, and deltas that no longer exist after the rename / reorder / merge / split sequence: - Restore the in-place explanation for the "rewards before drops" HACK in `ashard/loading.rs` (the dangling "see comment on the pre-shard path" pointed to a comment that was deleted when the prepare phase was removed). - Drop "prepare phase" / "finalize phase" wording from `BoundaryWork` field docstrings, `commit_ewrap` comments, and `loading.rs` section dividers — neither phase exists; there's only Ewrap (global + close) and AccountShard (per-account). - Update the ESTART `EpochTransition` description in `work_units.md` so it reflects the post-merge data flow: AccountShards populate the accumulators directly, then Ewrap reads them back and emits `EpochWrapUp` with the final `EndStats` (no `EpochEndInit` patch step anymore). - Rename `compute_prepare_deltas` → `compute_ewrap_deltas`. The "prepare" name was a leftover from the `EwrapPrepare` work unit; the method is now the only Ewrap-phase compute helper. - Tighten `load_pending_rewards_range` docstring; flag that the `None` branch is currently unused. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shard-related identifiers and comments were named after the legacy EWRAP pipeline that bundled the global epoch-boundary work and the per-account shards together. With AccountShard now a distinct work unit in its own module, those names are misleading. Rename to use the `ashard` prefix consistently with the module path: - `CardanoConfig::ewrap_total_shards` → `ashard_total` - `CardanoConfig::DEFAULT_EWRAP_TOTAL_SHARDS` → `DEFAULT_ASHARD_TOTAL` - `EpochState::ewrap_progress` → `ashard_progress` - `prev_ewrap_progress` → `prev_ashard_progress` on `EpochEndAccumulate`, `EpochWrapUp`, and `EpochTransition` - `WorkBuffer::receive_block` / `on_ewrap_boundary` / `pop_work` parameter `ewrap_total_shards` → `ashard_total` - Error messages in `ashard/shard.rs` updated to match. Also fixes comment / doc misattributions where "EWRAP" was used for work that's now in `AccountShard`: - `PendingRewardState` / `DequeueReward` are consumed by `AccountShard`, not Ewrap. - `PendingMirState` / `DequeueMir` are consumed by Ewrap (clarified). - `AppliedReward` and the `applied_rewards` field are populated during AccountShard, not Ewrap. - RUPD's docstring now says rewards are consumed by `AccountShard`. - Crash-recovery wording in `lib.rs` says "mid-boundary" instead of "mid-EWRAP" since the cursor specifically tracks AccountShard progress. BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set `ewrap_total_shards` need to rename the key to `ashard_total`. Users relying on the default (omitted) are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the type and variant names with the module path convention: - struct `AccountShardWorkUnit` → `AShardWorkUnit` - enum variant `CardanoWorkUnit::AccountShard` → `AShard` - enum variant `InternalWorkUnit::AccountShard` → `AShard` - WorkBuffer state `AccountShardingBoundary` → `AShardingBoundary` - module re-export and all callers updated to match - prose / docstrings / log messages also use `AShard` consistently The module path is `crate::ashard`, so the type now reads as `ashard::AShardWorkUnit`. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback: the user-facing config name should be self-explanatory in `dolos.toml`. Renames everywhere for consistency: - `CardanoConfig::ashard_total` field → `account_shards` - `CardanoConfig::ashard_total()` accessor → `account_shards()` - `CardanoConfig::DEFAULT_ASHARD_TOTAL` → `DEFAULT_ACCOUNT_SHARDS` - WorkBuffer parameters and error messages updated to match. BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set this option (under any prior name from this PR) need to use `account_shards`. Users relying on the default are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against a config change to `account_shards` corrupting an
in-flight boundary. Previously, if dolos crashed mid-boundary and the
operator changed `account_shards` between crash and restart, the resume
would re-partition the account key space with the new count, mismatching
the cursor's already-committed shards.
Fix: snapshot the boundary's shard count into state at the first
`EpochEndAccumulate` apply. The persisted total is authoritative for the
duration of the in-flight boundary; the new config value only takes
effect on the next boundary.
Changes:
- New `AShardProgress { committed, total }` struct stored at
`EpochState.ashard_progress: Option<AShardProgress>` (was
`Option<u32>`).
- `EpochEndAccumulate` carries `total_shards`. Its apply validates the
delta's `total_shards` matches any previously persisted total and
surfaces an error if they diverge (would only happen if a work unit
was constructed with a stale config view).
- `EpochWrapUp` and `EpochTransition` undo fields adapted to the new
type.
- `AShardWorkUnit::load` / `commit_state` read the persisted total when
present and fall back to `config.account_shards()` for fresh
boundaries.
- `CardanoLogic` caches `effective_account_shards` (= persisted total
if a boundary is in flight, else config). Refreshed at every
`pop_work` call so `receive_block` (which has no state access) can
use the up-to-date value when constructing
`WorkBuffer::AShardingBoundary`.
- Crash-recovery wording updated to surface a clear warning when the
persisted total disagrees with current config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restructures the EWRAP epoch-boundary pipeline across six commits:
948d472e, pre-existing) — initial split of the monolithic EWRAP into three phases.EwrapPrepare→Ewrap, seedEpochState.endin ESTART (9627c3ae) — drop the misleading "Prepare" name; ESTART'sEpochTransitionnow opens theendslot.EwrapShard→AccountShard(ae260792) — name the per-account work unit by what it processes, not its position in the pipeline.AccountShardruns beforeEwrap(eaf8e08c) — settle per-account effects first, then run the global Ewrap visitors.EpochEndInitbecomes a PATCH delta;EpochEndAccumulateacceptsewrap_progress = Noneas the natural shard-0 starting state.EwrapFinalizeintoEwrap; dropEpochEndInit(a1e0d5cf) — collapse the redundant finalize phase into Ewrap; the wrap-up visitor now assembles the fullEndStats(prepare-time fields + shard accumulators) and emits a singleEpochWrapUp. One fewer state-machine state, one fewer delta type, one fewer commit cycle. Boundary close is now atomic in a single state-writer commit.EpochState.endin Genesis (5122e075) — restores the invariant thatend = Some(...)whenever anAccountShardruns (the first boundary's shards otherwise hit a fresh genesis state withend = None).Final pipeline:
Estart → Roll … → Rupd → Roll … → AccountShard ×N → Ewrap(where Ewrap now both runs the global visitors and closes the boundary).Test plan
cargo check -p dolos-cardanocleancargo test -p dolos-cardano— all 98 unit tests pass (work-buffer state-machine, epoch delta proptests, etc.)cargo build --tests— full workspace compilescargo test --test memorypassescargo test --test epoch_pots --release— gold-standard end-to-end against DBSync ground truth (requires populated Mithril test instance)5122e075resolved the original panic🤖 Generated with Claude Code