ct/frontend: reserve producer_queue ticket before stage_write by oleiman · Pull Request #30429 · redpanda-data/redpanda

oleiman · 2026-05-09T02:08:48Z

The per-producer ordering ticket was reserved inside the .then_wrapped continuation that fires after stage_write resolves. stage_write is asynchronous (prepare_write has multiple yield points: serialize_batches and the data-plane memory/request semaphores), so its futures can resolve in a different order than the replicate() calls that produced them. The ticket order is what rm_stm ultimately sees as the batch order; reordering there breaks idempotent-producer sequence checks.

Specifically a later batch with first_seq>0 from a producer that is unknown to that partition's rm_stm takes the kafka-compatible skip_sequence_checks path (rm_stm.cc:1156), advancing last_known_sequence past the earlier batch. The earlier batch is then permanently rejected with OUT_OF_ORDER_SEQUENCE_NUMBER and the kafka client retries it indefinitely while the broker keeps advancing on later batches. This surfaced as a steady-state OOOSN floor at workload start in cloud-mode benchmark runs.

Reserve the ticket synchronously before stage_write and capture it into the continuation. producer_queue::reserve is single-shard and synchronous, so call order determines ticket order unconditionally. Concurrent uploads are unchanged; the ticket only gates the rm_stm replicate step inside do_upload_and_replicate.

Regression test mocks stage_write so the second call's future resolves before the first's and asserts both batches still succeed. Without the fix it reproduces the production OOOSN: rm_stm logs "Accepting batch from unknown producer that likely got evicted" for the reordered second batch and then rejects the first with cluster::errc::sequence_out_of_order.

Backports Required

Release Notes

Bug Fixes

Fixes a bug in the cloud topics write path that could cause idempotency errors after a cold start.

The per-producer ordering ticket was reserved inside the .then_wrapped continuation that fires after stage_write resolves. stage_write is asynchronous (prepare_write has multiple yield points: serialize_batches and the data-plane memory/request semaphores), so its futures can resolve in a different order than the replicate() calls that produced them. The ticket order is what rm_stm ultimately sees as the batch order; reordering there breaks idempotent-producer sequence checks. Specifically a later batch with first_seq>0 from a producer that is unknown to that partition's rm_stm takes the kafka-compatible skip_sequence_checks path (rm_stm.cc:1156), advancing last_known_sequence past the earlier batch. The earlier batch is then permanently rejected with OUT_OF_ORDER_SEQUENCE_NUMBER and the kafka client retries it indefinitely while the broker keeps advancing on later batches. This surfaced as a steady-state OOOSN floor at workload start in cloud-mode benchmark runs. Reserve the ticket synchronously before stage_write and capture it into the continuation. producer_queue::reserve is single-shard and synchronous, so call order determines ticket order unconditionally. Concurrent uploads are unchanged; the ticket only gates the rm_stm replicate step inside do_upload_and_replicate. Regression test mocks stage_write so the second call's future resolves before the first's and asserts both batches still succeed. Without the fix it reproduces the production OOOSN: rm_stm logs "Accepting batch from unknown producer that likely got evicted" for the reordered second batch and then rejects the first with cluster::errc::sequence_out_of_order.

oleiman · 2026-05-09T02:09:18Z

/ci-repeat 1

claude · 2026-05-09T02:15:12Z

Review summary

The fix is correct, narrowly scoped, and well-explained. Reserving the producer_queue ticket synchronously before stage_write is exactly what the producer queue's contract requires: ticket order has to match replicate() call order, not the order in which asynchronous stage_write futures resolve. The detailed PR description (skip-sequence-checks path → last_known_sequence advances past the earlier batch → permanent OOOSN on the prior batch) matches what I see in rm_stm and do_upload_and_replicate.

Correctness checks I made

Lifetime / leaks on early failure paths. producer_ticket's destructor calls release(), which both pops itself from the per-pid linked list in the producer_state_map and resolves the next ticket's promise if it was the head. So:
- If stage_write resolves with a failed future, the lambda sets the exception and returns; the captured ticket and ctp_stm destruct cleanly, releasing any successor in the queue.
- If anything between reserve(...) and then_wrapped(...) throws (e.g. batch.header(), chunked_vector::push_back), the local ticket destructs and the queue invariant holds.
- The _map/_prev/_next chain in ticket_impl::release correctly stitches predecessors and successors when a ticket is released out of order, so a failed earlier ticket does not strand a later one.
Concurrency. producer_queue is single-shard and partitions are shard-affined, so all frontend::replicate calls for the same partition land on the same shard. Synchronous reserve is sufficient to capture call order.
Move semantics. The captures ticket = std::move(ticket) and ctp_stm = std::move(ctp_stm_api) plus the mutable qualifier and the moves into do_upload_and_replicate(...) parameters are all consistent — no double-moves, no unnecessary copies.
No similar bug elsewhere. producer_queue().reserve(...) is only called at this one site (frontend.cc:953); the other replicate(chunked_vector<...>) overload doesn't go through the producer queue.

Behavior delta worth noting

In the old code, a synchronous failure in stage_write would never have touched the producer queue. In the new code, the ticket is reserved first and then released by the lambda destructor on failure. Concurrent in-flight requests for the same producer therefore briefly serialize behind the failed call's ticket. This is correct and almost certainly fine in practice (single-shard list ops), but worth noting in case it changes failure-mode latency under heavy contention.

Test

The regression test correctly exercises the bug — forcing B2's stage_write to resolve before B1's, with same-pid sequential first_seq so a reorder trips skip_sequence_checks → OOOSN. The inversion is tight and the assertions pin down which side of the bug is which.

I left a small note inline about request_enqueued.get() not actually witnessing that B2 has reached ticket.redeem(); the test still works because the replicate_finished.get() calls converge it, but the surrounding comment overstates the synchronization guarantee. Two other inline nits: the rm_stm.cc:1156 line reference will rot, and there's a pre-existing _ctp_stm_api member that could replace the make_ctp_stm_api(_partition) call you're now hoisting out of the continuation.

Backport

v26.1.x checked in the PR template, which matches the user-visible bug fix in the release notes. No tests/code outside cloud_topics/frontend change, so the backport should be mechanical.

LGTM modulo the small comment tweaks above.

Copilot

Pull request overview

Fixes a cloud-topics produce-path ordering bug where per-producer ordering tickets were reserved only after data_plane_api::stage_write() completed, allowing out-of-order stage_write resolution to reorder tickets and break idempotent producer sequencing in rm_stm.

Changes:

Reserve the producer_queue ordering ticket synchronously before initiating stage_write, and move the ticket into the async continuation.
Add a regression test that forces stage_write futures to resolve out of order and asserts both idempotent batches still succeed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/v/cloud_topics/frontend/frontend.cc	Reserves per-producer ordering ticket before `stage_write` and threads it through the continuation to preserve replicate call order.
src/v/cloud_topics/frontend/tests/frontend_test.cc	Adds regression coverage for out-of-order `stage_write` resolution while maintaining per-producer replicate ordering.

+
+    EXPECT_CALL(*_data_plane, stage_write(_))
+      .WillOnce(Return(ByMove(b1_stage.get_future())))
+      .WillOnce(Return(ss::as_ready_future(stage_result{})));
+
+    // Both batches use the same cluster_epoch so neither fence step is
+    // rejected on epoch grounds. The order of `execute_write` calls is the
+    // same in both buggy and fixed code (B2 first because its stage_write
+    // resolves first; B1 second).
+    EXPECT_CALL(*_data_plane, execute_write(_, _, _, _))
+      .WillOnce(Return(make_extent_fut(model::offset(0), cluster_epoch(1))))
+      .WillOnce(Return(make_extent_fut(model::offset(1), cluster_epoch(1))));


claude · 2026-05-09T02:15:24Z

@@ -942,6 +942,15 @@ raft::replicate_stages frontend::replicate(
    }

    auto ctp_stm_api = make_ctp_stm_api(_partition);


Pre-existing nit (unblocking): frontend already holds _ctp_stm_api as a member (initialized in the ctor) that wraps the same underlying ctp_stm. While you're reshuffling this, you could replace auto ctp_stm_api = make_ctp_stm_api(_partition); with auto ctp_stm_api = _ctp_stm_api; (or auto ctp_stm = _ctp_stm_api; captured into the lambda). It avoids an extra lw_shared_ptr alloc per replicate and the surprising fact that make_ctp_stm_api throws (despite the ctor being noexcept).

Happy to defer this to a follow-up — the producer-queue ordering fix is the load-bearing change here.

claude · 2026-05-09T02:15:24Z

+// `skip_sequence_checks` path (rm_stm.cc:1156) because the producer is
+// unknown to that partition. `last_known_sequence` advances to B2's
+// last_seq, and the subsequent B1 (first_seq=0) is rejected with
+// OUT_OF_ORDER_SEQUENCE_NUMBER (cluster::errc::sequence_out_of_order).


Nit: line references like rm_stm.cc:1156 will rot fast — consider naming the function (e.g. "the skip_sequence_checks branch in rm_stm::check_seq") instead. Same for the duplicate reference at line 405.

claude · 2026-05-09T02:15:25Z

+    // waiting for B1's ticket release; that's still consistent with
+    // `request_enqueued` resolving (which only requires the
+    // `then_wrapped` callback to return).
+    stages2.request_enqueued.get();


Question on test robustness: stages2.request_enqueued.get() only requires B2's then_wrapped callback to return — it does not guarantee B2's do_upload_and_replicate has reached ticket.redeem() (or even execute_write). For the buggy-code scenario you describe in the comment, the subsequent b1_stage.set_value(...) plus the two replicate_finished.get() calls are what eventually converge things, so the assertion still holds either way. But the comment block above ("In the fixed code B2's chain runs through stage_write but blocks at ticket.redeem()...") is slightly misleading since request_enqueued resolving doesn't actually witness that.

Either tighten the comment, or — for an extra-strong guarantee in the buggy case — wait until B2 has reached its rm_stm replicate before unblocking B1 (e.g. expose a hook through the mock or yield with ss::sleep/ss::later a few times). Not strictly required if the bug-case assertion is reliable on CI today.

vbotbuildovich · 2026-05-09T03:28:17Z

Retry command for Build#84248

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_auto_prefix_trimming@{"source_cluster_spec":{"cluster_type":"redpanda"},"storage_mode":"tiered","with_failures":true}

vbotbuildovich · 2026-05-09T03:40:33Z

CI test results

test results on build#84248

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(FAIL)	ShadowLinkingReplicationTests	test_auto_prefix_trimming	{"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered", "with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/84248#019e0a8b-92d7-4189-bdd9-a14740bd8c15	9/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0003, p0=0.0031, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming

oleiman · 2026-05-09T05:55:31Z

/ci-repeat 1
release
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_auto_prefix_trimming@{"source_cluster_spec":{"cluster_type":"redpanda"},"storage_mode":"tiered","with_failures":true}

oleiman self-assigned this May 9, 2026

github-actions Bot added the area/redpanda label May 9, 2026

oleiman requested a review from Copilot May 9, 2026 02:09

oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label May 9, 2026

Copilot started reviewing on behalf of oleiman May 9, 2026 02:09 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

claude Bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ct/frontend: reserve producer_queue ticket before stage_write#30429

ct/frontend: reserve producer_queue ticket before stage_write#30429
oleiman wants to merge 1 commit intodevfrom
ct/noticket/high-volume-of-ooosn

oleiman commented May 9, 2026

Uh oh!

oleiman commented May 9, 2026

Uh oh!

claude Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

claude Bot May 9, 2026

Uh oh!

claude Bot May 9, 2026

Uh oh!

claude Bot May 9, 2026

Uh oh!

vbotbuildovich commented May 9, 2026

Uh oh!

vbotbuildovich commented May 9, 2026

Uh oh!

oleiman commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -942,6 +942,15 @@ raft::replicate_stages frontend::replicate(
		}

		auto ctp_stm_api = make_ctp_stm_api(_partition);

Conversation

oleiman commented May 9, 2026

Backports Required

Release Notes

Bug Fixes

Uh oh!

oleiman commented May 9, 2026

Uh oh!

claude Bot commented May 9, 2026

Review summary

Correctness checks I made

Behavior delta worth noting

Test

Backport

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

claude Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented May 9, 2026

Retry command for Build#84248

Uh oh!

vbotbuildovich commented May 9, 2026

CI test results

Uh oh!

oleiman commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants