Skip to content

ct/frontend: reserve producer_queue ticket before stage_write#30429

Draft
oleiman wants to merge 1 commit intodevfrom
ct/noticket/high-volume-of-ooosn
Draft

ct/frontend: reserve producer_queue ticket before stage_write#30429
oleiman wants to merge 1 commit intodevfrom
ct/noticket/high-volume-of-ooosn

Conversation

@oleiman
Copy link
Copy Markdown
Member

@oleiman oleiman commented May 9, 2026

The per-producer ordering ticket was reserved inside the .then_wrapped continuation that fires after stage_write resolves. stage_write is asynchronous (prepare_write has multiple yield points: serialize_batches and the data-plane memory/request semaphores), so its futures can resolve in a different order than the replicate() calls that produced them. The ticket order is what rm_stm ultimately sees as the batch order; reordering there breaks idempotent-producer sequence checks.

Specifically a later batch with first_seq>0 from a producer that is unknown to that partition's rm_stm takes the kafka-compatible skip_sequence_checks path (rm_stm.cc:1156), advancing last_known_sequence past the earlier batch. The earlier batch is then permanently rejected with OUT_OF_ORDER_SEQUENCE_NUMBER and the kafka client retries it indefinitely while the broker keeps advancing on later batches. This surfaced as a steady-state OOOSN floor at workload start in cloud-mode benchmark runs.

Reserve the ticket synchronously before stage_write and capture it into the continuation. producer_queue::reserve is single-shard and synchronous, so call order determines ticket order unconditionally. Concurrent uploads are unchanged; the ticket only gates the rm_stm replicate step inside do_upload_and_replicate.

Regression test mocks stage_write so the second call's future resolves before the first's and asserts both batches still succeed. Without the fix it reproduces the production OOOSN: rm_stm logs "Accepting batch from unknown producer that likely got evicted" for the reordered second batch and then rejects the first with cluster::errc::sequence_out_of_order.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Bug Fixes

  • Fixes a bug in the cloud topics write path that could cause idempotency errors after a cold start.

The per-producer ordering ticket was reserved inside the .then_wrapped
continuation that fires after stage_write resolves. stage_write is
asynchronous (prepare_write has multiple yield points: serialize_batches
and the data-plane memory/request semaphores), so its futures can
resolve in a different order than the replicate() calls that produced
them. The ticket order is what rm_stm ultimately sees as the batch
order; reordering there breaks idempotent-producer sequence checks.

Specifically a later batch with first_seq>0 from a producer that is
unknown to that partition's rm_stm takes the kafka-compatible
skip_sequence_checks path (rm_stm.cc:1156), advancing last_known_sequence
past the earlier batch. The earlier batch is then permanently rejected
with OUT_OF_ORDER_SEQUENCE_NUMBER and the kafka client retries it
indefinitely while the broker keeps advancing on later batches. This
surfaced as a steady-state OOOSN floor at workload start in cloud-mode
benchmark runs.

Reserve the ticket synchronously before stage_write and capture it into
the continuation. producer_queue::reserve is single-shard and
synchronous, so call order determines ticket order unconditionally.
Concurrent uploads are unchanged; the ticket only gates the rm_stm
replicate step inside do_upload_and_replicate.

Regression test mocks stage_write so the second call's future resolves
before the first's and asserts both batches still succeed. Without the
fix it reproduces the production OOOSN: rm_stm logs "Accepting batch
from unknown producer that likely got evicted" for the reordered second
batch and then rejects the first with cluster::errc::sequence_out_of_order.
@oleiman oleiman self-assigned this May 9, 2026
@oleiman oleiman requested a review from Copilot May 9, 2026 02:09
@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented May 9, 2026

/ci-repeat 1

@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label May 9, 2026
@claude
Copy link
Copy Markdown

claude Bot commented May 9, 2026

Review summary

The fix is correct, narrowly scoped, and well-explained. Reserving the producer_queue ticket synchronously before stage_write is exactly what the producer queue's contract requires: ticket order has to match replicate() call order, not the order in which asynchronous stage_write futures resolve. The detailed PR description (skip-sequence-checks path → last_known_sequence advances past the earlier batch → permanent OOOSN on the prior batch) matches what I see in rm_stm and do_upload_and_replicate.

Correctness checks I made

  • Lifetime / leaks on early failure paths. producer_ticket's destructor calls release(), which both pops itself from the per-pid linked list in the producer_state_map and resolves the next ticket's promise if it was the head. So:
    • If stage_write resolves with a failed future, the lambda sets the exception and returns; the captured ticket and ctp_stm destruct cleanly, releasing any successor in the queue.
    • If anything between reserve(...) and then_wrapped(...) throws (e.g. batch.header(), chunked_vector::push_back), the local ticket destructs and the queue invariant holds.
    • The _map/_prev/_next chain in ticket_impl::release correctly stitches predecessors and successors when a ticket is released out of order, so a failed earlier ticket does not strand a later one.
  • Concurrency. producer_queue is single-shard and partitions are shard-affined, so all frontend::replicate calls for the same partition land on the same shard. Synchronous reserve is sufficient to capture call order.
  • Move semantics. The captures ticket = std::move(ticket) and ctp_stm = std::move(ctp_stm_api) plus the mutable qualifier and the moves into do_upload_and_replicate(...) parameters are all consistent — no double-moves, no unnecessary copies.
  • No similar bug elsewhere. producer_queue().reserve(...) is only called at this one site (frontend.cc:953); the other replicate(chunked_vector<...>) overload doesn't go through the producer queue.

Behavior delta worth noting

In the old code, a synchronous failure in stage_write would never have touched the producer queue. In the new code, the ticket is reserved first and then released by the lambda destructor on failure. Concurrent in-flight requests for the same producer therefore briefly serialize behind the failed call's ticket. This is correct and almost certainly fine in practice (single-shard list ops), but worth noting in case it changes failure-mode latency under heavy contention.

Test

The regression test correctly exercises the bug — forcing B2's stage_write to resolve before B1's, with same-pid sequential first_seq so a reorder trips skip_sequence_checks → OOOSN. The inversion is tight and the assertions pin down which side of the bug is which.

I left a small note inline about request_enqueued.get() not actually witnessing that B2 has reached ticket.redeem(); the test still works because the replicate_finished.get() calls converge it, but the surrounding comment overstates the synchronization guarantee. Two other inline nits: the rm_stm.cc:1156 line reference will rot, and there's a pre-existing _ctp_stm_api member that could replace the make_ctp_stm_api(_partition) call you're now hoisting out of the continuation.

Backport

v26.1.x checked in the PR template, which matches the user-visible bug fix in the release notes. No tests/code outside cloud_topics/frontend change, so the backport should be mechanical.

LGTM modulo the small comment tweaks above.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a cloud-topics produce-path ordering bug where per-producer ordering tickets were reserved only after data_plane_api::stage_write() completed, allowing out-of-order stage_write resolution to reorder tickets and break idempotent producer sequencing in rm_stm.

Changes:

  • Reserve the producer_queue ordering ticket synchronously before initiating stage_write, and move the ticket into the async continuation.
  • Add a regression test that forces stage_write futures to resolve out of order and asserts both idempotent batches still succeed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/v/cloud_topics/frontend/frontend.cc Reserves per-producer ordering ticket before stage_write and threads it through the continuation to preserve replicate call order.
src/v/cloud_topics/frontend/tests/frontend_test.cc Adds regression coverage for out-of-order stage_write resolution while maintaining per-producer replicate ordering.

Comment on lines +436 to +447

EXPECT_CALL(*_data_plane, stage_write(_))
.WillOnce(Return(ByMove(b1_stage.get_future())))
.WillOnce(Return(ss::as_ready_future(stage_result{})));

// Both batches use the same cluster_epoch so neither fence step is
// rejected on epoch grounds. The order of `execute_write` calls is the
// same in both buggy and fixed code (B2 first because its stage_write
// resolves first; B1 second).
EXPECT_CALL(*_data_plane, execute_write(_, _, _, _))
.WillOnce(Return(make_extent_fut(model::offset(0), cluster_epoch(1))))
.WillOnce(Return(make_extent_fut(model::offset(1), cluster_epoch(1))));
@@ -942,6 +942,15 @@ raft::replicate_stages frontend::replicate(
}

auto ctp_stm_api = make_ctp_stm_api(_partition);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing nit (unblocking): frontend already holds _ctp_stm_api as a member (initialized in the ctor) that wraps the same underlying ctp_stm. While you're reshuffling this, you could replace auto ctp_stm_api = make_ctp_stm_api(_partition); with auto ctp_stm_api = _ctp_stm_api; (or auto ctp_stm = _ctp_stm_api; captured into the lambda). It avoids an extra lw_shared_ptr alloc per replicate and the surprising fact that make_ctp_stm_api throws (despite the ctor being noexcept).

Happy to defer this to a follow-up — the producer-queue ordering fix is the load-bearing change here.

// `skip_sequence_checks` path (rm_stm.cc:1156) because the producer is
// unknown to that partition. `last_known_sequence` advances to B2's
// last_seq, and the subsequent B1 (first_seq=0) is rejected with
// OUT_OF_ORDER_SEQUENCE_NUMBER (cluster::errc::sequence_out_of_order).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: line references like rm_stm.cc:1156 will rot fast — consider naming the function (e.g. "the skip_sequence_checks branch in rm_stm::check_seq") instead. Same for the duplicate reference at line 405.

// waiting for B1's ticket release; that's still consistent with
// `request_enqueued` resolving (which only requires the
// `then_wrapped` callback to return).
stages2.request_enqueued.get();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on test robustness: stages2.request_enqueued.get() only requires B2's then_wrapped callback to return — it does not guarantee B2's do_upload_and_replicate has reached ticket.redeem() (or even execute_write). For the buggy-code scenario you describe in the comment, the subsequent b1_stage.set_value(...) plus the two replicate_finished.get() calls are what eventually converge things, so the assertion still holds either way. But the comment block above ("In the fixed code B2's chain runs through stage_write but blocks at ticket.redeem()...") is slightly misleading since request_enqueued resolving doesn't actually witness that.

Either tighten the comment, or — for an extra-strong guarantee in the buggy case — wait until B2 has reached its rm_stm replicate before unblocking B1 (e.g. expose a hook through the mock or yield with ss::sleep/ss::later a few times). Not strictly required if the bug-case assertion is reliable on CI today.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#84248

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_auto_prefix_trimming@{"source_cluster_spec":{"cluster_type":"redpanda"},"storage_mode":"tiered","with_failures":true}

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#84248
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(FAIL) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/84248#019e0a8b-92d7-4189-bdd9-a14740bd8c15 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0003, p0=0.0031, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming

@oleiman
Copy link
Copy Markdown
Member Author

oleiman commented May 9, 2026

/ci-repeat 1
release
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_auto_prefix_trimming@{"source_cluster_spec":{"cluster_type":"redpanda"},"storage_mode":"tiered","with_failures":true}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda claude-review Adding this label to a PR will trigger a workflow to review the code using claude.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants