Skip to content

[DO NOT MERGE] Release v0.47.4#3267

Draft
Dentosal wants to merge 7 commits intorelease/v0.47.3from
release/v0.47.4
Draft

[DO NOT MERGE] Release v0.47.4#3267
Dentosal wants to merge 7 commits intorelease/v0.47.3from
release/v0.47.4

Conversation

@Dentosal
Copy link
Copy Markdown
Member

@Dentosal Dentosal commented Apr 17, 2026

Version 0.47.4

Changed

  • 3138: Migrate CI from BuildJet to WarpBuild runners, update GitHub Actions to latest versions, and use pre-built binaries for cargo-nextest and cargo-audit.
  • 3203: Add lease port for PoA adapter to allow multiple producers to be live but only one leader.
  • 3225: PoA quorum and HA failover fixes: Redis leader lease adapter improvements, write_block.lua HEIGHT_EXISTS check, sub-quorum block repair, Prometheus metrics, and chaos test harness.

Fixed

  • 3124: Using Debian Bookworm as the runtime base image for Docker builds. This is the same base image as the Rust builder images. Keeping the images in-sync will help prevent runtime dependency mismatch issues.
  • 3264: Rollback stale preconfirmations in the mempool when the canonical block at that height omits the preconfirmed transactions, restoring spent inputs and removing dependent transactions.

Dentosal and others added 2 commits April 17, 2026 15:23
Closes #3098.

When a block producer sends preconfirmation updates, sentry nodes
optimistically treat the included transactions as committed, removing
them from the mempool and marking their inputs as spent. If the producer
crashes and re-produces a block at the same height without those
transactions, the mempool is left in a stale state: inputs stay marked
as spent and outputs linger in `extracted_outputs`, preventing
re-submission of rolled-back transactions and causing dependents to
reference non-existent UTXOs.

This PR makes preconfirmed transactions tentative until the canonical
block at their height is imported. On import, preconfirmed txs present
in the block are confirmed and their tracking is cleared; those absent
are rolled back by restoring inputs, purging dependents, and emitting
`SqueezedOut` notifications. It also adds integration tests:
re-insertion after rollback, dependent eviction, normal confirmation,
and stale-height cleanup.

- [x] Breaking changes are clearly marked as such in the PR description
and changelog
- [x] New behavior is reflected in tests
- [x] [The specification](https://github.com/FuelLabs/fuel-specs/)
matches the implemented behavior (link update PR if changes are needed)

- [ ] I have reviewed the code myself
- [x] I have created follow-up issues caused by this PR and linked them
here
@Dentosal Dentosal self-assigned this Apr 17, 2026
@Dentosal Dentosal added the pr release Used to trigger the github action to update versions label Apr 17, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 17, 2026

PR Summary

Medium Risk
Touches PoA consensus reconciliation/sync state and txpool preconfirmation handling; regressions could impact block production or mempool correctness, though changes are targeted and heavily covered by new tests.

Overview
Bumps workspace/package versions to 0.47.4, updates CHANGELOG.md, and refreshes .changes entries; also expands cargo-audit ignore list for newly-tracked advisories.

Fixes PoA HA edge cases by (1) grouping reconciliation “votes” by block_id (not (epoch, block_id)) so identical blocks across nodes reach quorum even with mismatched epochs, (2) publishing blocks to Redis nodes in parallel, and (3) optimizing write_block.lua to stop scanning once older heights are reached.

Prevents a PoA leader deadlock after reconciliation import by introducing a shared reconciliation_watermark so SyncTask treats reconciliation-imported network blocks as safe and doesn’t transition Synced → NotSynced; adds regression tests for both PoA scenarios.

Adds txpool preconfirmation rollback: tracks tentative preconfirmations by block height, records “tentative” spent inputs, ignores late preconfirmations below the canonical tip, and on block import rolls back preconfirmed tx state (including dependent coin- and contract-based transactions) when the canonical block omits them, with a new dedicated test suite.

Reviewed by Cursor Bugbot for commit 6241f5c. Bugbot is set up for automated code reviews on this repo. Configure here.

@Dentosal Dentosal changed the base branch from release/v0.47.3 to master April 17, 2026 12:36
@xgreenx xgreenx changed the base branch from master to release/v0.47.3 April 17, 2026 14:15
MitchTurner and others added 4 commits April 17, 2026 15:33
## Linked Issues/PRs
<!-- List of related issues/PRs -->

## Description
<!-- List of detailed changes -->

## Checklist
- [ ] Breaking changes are clearly marked as such in the PR description
and changelog
- [ ] New behavior is reflected in tests
- [ ] [The specification](https://github.com/FuelLabs/fuel-specs/)
matches the implemented behavior (link update PR if changes are needed)

### Before requesting review
- [ ] I have reviewed the code myself
- [ ] I have created follow-up issues caused by this PR and linked them
here

### After merging, notify other teams

[Add or remove entries as needed]

- [ ] [Rust SDK](https://github.com/FuelLabs/fuels-rs/)
- [ ] [Sway compiler](https://github.com/FuelLabs/sway/)
- [ ] [Platform
documentation](https://github.com/FuelLabs/devrel-requests/issues/new?assignees=&labels=new+request&projects=&template=NEW-REQUEST.yml&title=%5BRequest%5D%3A+)
(for out-of-organization contributors, the person merging the PR will do
this)
- [ ] Someone else?
…same-block dea… (#3271)

…dlock (#3269)

## Summary

- Fixes a PoA reconciliation deadlock observed on devnet 2026-04-17
where the same block ended up on all 6 Redis nodes with three different
epochs, causing permanent livelock
- `unreconciled_blocks` now groups votes by `block_id` only, tracking
max epoch as a tiebreaker. Identical blocks written during re-promotion
storms count toward quorum.
- Added a regression test that reproduces the exact production error
string

## The bug

During re-promotion storms (two pods racing for leadership), the same
block can be written to different Redis nodes with different epochs. The
old vote grouping `(epoch, block_id)` fragmented these identical blocks
into separate vote groups:

```
Node state (same block_id, different epoch stamps):
  1a-0, 1a-1, 1b-1: epoch 268 → vote group A, count=3
  1b-0:            epoch 269 → vote group B, count=1
  1c-0, 1c-1:      epoch 270 → vote group C, count=2  ← max-epoch winner

Required quorum: 4.  Winner count: 2 → repair attempted.
Repair writes the winner to all 6 nodes → HEIGHT_EXISTS on every node
(each has SOME entry at that height) → Written=0 → total=2 < quorum.
Permanent livelock.
```

## The fix

Group by `block_id` alone; track max epoch per block_id as the
tiebreaker when block_ids genuinely differ:

```rust
// Before
HashMap::<(u64, BlockId), (usize, SealedBlock)>
vote_key = (*epoch, block.entity.id())
winner = max_by_key(epoch)

// After
HashMap::<BlockId, (u64, usize, SealedBlock)>  // (max_epoch, count, block)
vote_key = block.entity.id()
winner = max_by_key(max_epoch)
```

**Behavior change:**
- Same block with multiple epochs → single vote group → counts as a
single block on N nodes → reconciles directly without repair (this fixes
the deadlock)
- Genuinely different blocks at same height → picks higher-epoch block →
same behavior as before

## Test plan

- [x] New test
`leader_state__when_same_block_has_different_epochs_across_nodes_then_reconciles_without_repair`
reproduces the exact production error without the fix (`"Backlog
unresolved at height 1: repair failed to reach quorum"`) and passes with
it
- [x] All 9 existing `leader_state__*` tests still pass
- [ ] Deploy to devnet and verify the stuck authority recovers

Please go to the `Preview` tab and select the appropriate sub-template:

* [Classic PR](?expand=1&template=default.md)
* [Bump version](?expand=1&template=bump_version.md)

---------

Co-authored-by: Brandon Kite <brandonkite92@gmail.com>
…port (#3261) (#3274)

cherry-pick #3261

## Summary

- Fixes a deadlock in the PoA service that caused a 30-minute block
production outage on testnet (April 9, 2026)
- After a FENCING_ERROR, reconciliation imports a block via
`execute_and_commit` which marks it as `Source::Network`. The SyncTask
sees this and transitions from `Synced` → `NotSynced`. On the next
iteration, `ensure_synced()` blocks forever — the leader can't produce
while blocked, and the SyncTask needs a locally-produced block to
recover. Classic deadlock.
- Fix: add a reconciliation watermark (`Arc<AtomicU32>`) shared between
`MainTask` and `SyncTask`. Before importing reconciliation blocks,
`MainTask` sets the watermark to the max height. `SyncTask` treats
blocks at heights ≤ the watermark as locally produced, staying `Synced`.

## Details

**Root cause chain:**
1. `importer.rs:584-585` — `execute_and_commit` always uses
`ImportResult::new_from_network()`
2. `sync.rs:186-203` — SyncTask transitions `Synced → NotSynced` on
non-local block with height > current
3. `service.rs:501-521` — `ensure_synced()` blocks on
`sync_state.changed()` when `NotSynced`
4. Deadlock: leader blocked in `ensure_synced()`, SyncTask waiting for
locally-produced block that can never arrive

**Why a watermark:** A bool flag has a race condition — the SyncTask may
not poll the broadcast channel until after the flag is cleared. The
watermark encodes a permanent fact ("all blocks up to height N were
reconciled") that never needs clearing.

**Files changed (all within `fuel-core-poa`):**
- `sync.rs` — Add `reconciliation_watermark` field, check it in block
handler
- `service.rs` — Create shared watermark, set via `fetch_max` during
reconciliation
- `service_test.rs` — Add deadlock reproduction test

## Test plan

- [x]
`sync_task__network_block_at_reconciliation_height_causes_not_synced_without_watermark`
— confirms bug mechanism (network block → NotSynced)
- [x] `sync_task__network_block_within_watermark_stays_synced` —
verifies watermark prevents NotSynced; blocks above watermark still
trigger it
- [x] `main_task__reconciliation_import_does_not_deadlock_leader` — full
service-level deadlock reproduction (fails without fix, passes with)
- [x] All 51 existing `fuel-core-poa` tests pass

---------

Please go to the `Preview` tab and select the appropriate sub-template:

* [Classic PR](?expand=1&template=default.md)
* [Bump version](?expand=1&template=bump_version.md)

---------

Co-authored-by: Brandon Kite <brandonkite92@gmail.com>
Co-authored-by: Green Baneling <XgreenX9999@gmail.com>
Co-authored-by: Hannes Karppila <2204863+Dentosal@users.noreply.github.com>
## Linked Issues/PRs
<!-- List of related issues/PRs -->

Cherrypick #3272

## Description
<!-- List of detailed changes -->

## Checklist
- [ ] Breaking changes are clearly marked as such in the PR description
and changelog
- [ ] New behavior is reflected in tests
- [ ] [The specification](https://github.com/FuelLabs/fuel-specs/)
matches the implemented behavior (link update PR if changes are needed)

### Before requesting review
- [ ] I have reviewed the code myself
- [ ] I have created follow-up issues caused by this PR and linked them
here

### After merging, notify other teams

[Add or remove entries as needed]

- [ ] [Rust SDK](https://github.com/FuelLabs/fuels-rs/)
- [ ] [Sway compiler](https://github.com/FuelLabs/sway/)
- [ ] [Platform
documentation](https://github.com/FuelLabs/devrel-requests/issues/new?assignees=&labels=new+request&projects=&template=NEW-REQUEST.yml&title=%5BRequest%5D%3A+)
(for out-of-organization contributors, the person merging the PR will do
this)
- [ ] Someone else?

Please go to the `Preview` tab and select the appropriate sub-template:

* [Classic PR](?expand=1&template=default.md)
* [Bump version](?expand=1&template=bump_version.md)
@MitchTurner MitchTurner added no changelog Skip the CI check of the changelog modification labels Apr 17, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6241f5c. Configure here.

if entry_height ~= nil and entry_height < posted_height then
stop_scan = true
break
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lua early-stop may miss HEIGHT_EXISTS on out-of-order streams

Medium Severity

The early-stop optimization halts the XREVRANGE scan when it finds an entry_height < posted_height. Since XREVRANGE iterates by stream ID (timestamp order), not by height value, this assumes heights are strictly monotonically increasing in the stream. If a partial write from a failed leader left an orphan at a higher height and then a new leader (with a valid epoch) successfully wrote at a lower height, the stream could contain out-of-order heights. In that case, the scan would stop early and miss an existing entry at posted_height, bypassing the HEIGHT_EXISTS safety check that prevents forks.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6241f5c. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fair assumption, since write_block.lua enforces that any new heights are greater than the current stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog Skip the CI check of the changelog modification pr release Used to trigger the github action to update versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants