Proposed v26.2.x: 68 RP commits on scylladb/master c302102b7 by travisdowns · Pull Request #277 · redpanda-data/seastar

travisdowns · 2026-05-07T02:48:45Z

This PR shows the proposed contents of the v26.2.x branch after rebase for review. It contains 66 redpanda-specific commits on top of scylladb/master at c302102b7. I don't expect anyone to review these lines by line as these have already been largely reviewed when originally checked in, but more look at the overall approach and maybe do spot
checks.

@dotnwat put you on here especially for the OpenSSL stuff.

Reference

Base (proposed-v26.2.x-merge-base): c302102b7c3ac10a02723167dcb155be908b135c — scylladb/master tip as of the rebase point. This is a frozen reference for the diff.
Head (travisdowns:proposed-v26.2.x): c6d7666d0081d7ea8fd10968525e986ba25a2c43 — the proposed v26.2.x with all RP commits replayed.

What's in here

66 commits, broken down as:

43 clean replays of v26.2.x-pre commits (patch identical modulo line-number shifts)
19 minor edits (small adaptations to upstream changes)
3 major edits (substantial rework)
9 ports of RP-specific TLS features rewritten against upstream's pluggable crypto provider architecture (folded into 4 commits)
2 ports of other RP-specific behavior on top of upstream

A further commits from v26.2.x-pre were dropped or upstreamed during this work.

Reading the diff

The TLS surface in particular saw the biggest changes because upstream merged a pluggable crypto provider rewrite (Noah Watkins, scylladb#3360 series); our RP-specific TLS features (cert_info, reload_callback_with_creds, dn_format) were rewritten against that new architecture rather than carried forward as-is.

Additional details

Two comments on this PR have additional context:

Abridged rebase notes — the high-level story (re-rebase history, upstream PRs, build/test status) and a link to the full notes in a gist.
Per-commit accounting — full table of every v26.2.x-pre commit with its disposition in this PR (clean / edits / ported / upstreamed / dropped).

Seastar http server implementation supports multiple listeners. It may be required for the handler logic to know which listener the connection is coming from. Added listener_idx field to `httpd::request` to allow handler recognize listener. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Since an exception carries some text for the response body text, the raising site might like to specify the content type if it's e.g. json. Signed-off-by: John Spray <jcs@vectorized.io>

This enables throwing a base_exception from a json request handler with a json payload inside it. Signed-off-by: John Spray <jcs@vectorized.io>

Signed-off-by: John Spray <jcs@vectorized.io>

Prior to this patch seastar only exposes one global metrics::impl::impl object which holds all metric related data for one application. This patch changes the implementation details such that multiple metrics::impl::impl objects can exist for any given application. Said objects are stored into a map on each shard and created dinamically whenever requested. A metrics::impl::impl is identified by an integer handle that acts as the key for the storage map. Implementation note: in order to avoid issues caused by the ordering of static thread_local objects I had to declare the storage in reactor.cc. (cherry picked from commit 585a8af)

This patch extends the metrics internal apis to use a specific metrics::impl::impl object identified by its integer handle. (cherry picked from commit 6ee4af7)

Add a public method to metric_groups_impl that exposes the handle of the internal implementation it is using. This is required in order for the metric_groups class to be able to reset itself to the configured implementation handle.

This patch extends the metrics user facing apis to use a specific metrics::impl::impl object identified by its integer handle. Note that the constructor of 'metric_groups' is marked explicit in this patch and updates two call sites where the constructor was used implicitly.

This patch removes two subsequent calls to `get_local_impl` and reuses the returned handle in that scope.

This patch extends the user facing prometheus apis allowing the user to specify the internal metrics implementation to be used through a handle. Additionally, 'add_prometheus_routes' now takes an argument that specifies the route on which to advertise the metrics. This enables different metrics "namespaces" to be served by different endpoints in isolation. (cherry picked from commit 6189522)

This patch extends the scollectd apis with the ability to select the internal metrics implementation to be used by providing a handle. (cherry picked from commit d4331d1)

This patch adds a 'get_skip_when_empy' getter to the 'registered_metric' class. It is used by follow-up patches in order to replicate metrics.

This patch adds private methods to the 'metrics::impl' class that deal with the creation of replicated metrics. They will be used to build the public api in future commits.

This patch adds private helpers to 'metrics::impl' that deal with the removal of replicated metric families from their destintation implementation. These methods will be used in subsequent commits to manage the lifetime of replicated metrics.

This patch adds a public method to the 'metrics::impl' class: 'set_metric_families_to_replicate'. When this method is called the families that match any of the specifications will be replicated on the specified destinations.

This patch extends the metric registration and unregistration processes to make them aware of metric replication. In the case of metric registration, if the new metric belongs to a family that matches one of the replication specs, then a replicated metric is created accordingly. For unregistration of a metric, the replicated metric is unregistered too if one exists.

This patch exposes a method in the public interface of the metrics module ('replicate_metric_families'), which enables metric replication internally for the requested metric families.

Extends the metrics api to allow changing the aggregation labels of a metrics family. Otherwise one had to un-register every single metric instance in a metric family and then re-register with the changed aggregation labels. For metric families with thousands of instances (e.g.: histograms with lots of different labels) this is quite expensive. With this change we avoid the full reconstruction of the metrics family and all its metrics. Only the work associated with marking the metrics `dirty()` is needed then.

This commit extends the public interface of scheduling_group to expose usage statistics (e.g. runtime).

Redpanda uses this logger name for its http client and we choose to change the logger name in Seastar to avoid duplicate logger registration exceptions.

We sort of inadvertently picked up the change to io_uring when rebasing our seastar fork, but for the coming release we'd like to keep using aio to reduce risk and give sufficient time to do performance tests on io_uring. This effectively rolls back the upstream commit: eedca15 We simply put aio and epoll above io_uring in the available list, the default backend is the first one in that list Issue redpanda-data/redpanda#10105.

This change is to allow for the timer code in the stall detectors to be used in a profiler implementation. The hope is to be able to reuse a lot of the stall_detector codebase in the profiler without complicating the existing stall_detector implementation. The new posix_timer constructor sets sigev_notify_thread_id via the macro that musl libc / FreeBSD / Linux define for that field, mirroring upstream commit bbe1af3 (which patched the cpu_stall_detector_posix_timer constructor but predated and so missed the new posix_timer site). see https://sourceware.org/bugzilla/show_bug.cgi?id=27417

Moves some functions and classes defined in the stall_detector test to a separate header file so they can be reused in other tests.

In libgcc there is a critical section where the stack is being modified so execution can return to an exceptions landing pad. However, this partially modified state isn’t valid or specified by the dwarf info in the eh_frames. So a seg fault tends to occur when `backtrace()` tries to unwind through this partially modified stack and follows an invalid pointer in it.

…eptions

in the configuration of the io_group. The io_groups can be used to get the original values which are needed to implement throttling in Redpanda.

StephanDollberg · 2026-05-07T08:40:59Z

Can this one be upstreamed:

a618fe8

These guys I thought you had upstreamed already or do we even still need them. IIRC they did change things in this area:

Did you try dropping this one. I feel like we shouldn't need it anymore after libfmt upgrade.

242f6d2

Same for:

2b4725b

If there is more work needed let me know, I will do it.

travisdowns · 2026-05-07T17:36:33Z

These guys I thought you had upstreamed already or do we even still need them. IIRC they did change things in this area:

* [c39663b](https://github.com/redpanda-data/seastar/commit/c39663b7e626d310608193b3b8ec8eb7dfb3622f)

* [ced4b17](https://github.com/redpanda-data/seastar/commit/ced4b1768b8e3746127a609a4fc30ce1116b77c7)

* [9430122](https://github.com/redpanda-data/seastar/commit/9430122c1126149d66f3567590689d83d955d4de)

I tried: scylladb#2799 (comment), but it was rejected. That ended with a proposal to do it another way, see scylladb#2813 for some prep work, but the rest never happened.

travisdowns · 2026-05-07T17:40:12Z

Can this one be upstreamed:

a618fe8

done: scylladb#3396

Allows for setting the certificate and key for the client demo and setting the CA for the server demo. Additionally did some refactoring to print ou DN verification callback in the server. Signed-off-by: Michael Boquard <michael@redpanda.com> Co-authored-by: Michael Boquard <michael@redpanda.com> Co-authored-by: Rob Blafford <rob@redpanda.com>

This function allows for the creation of an `http::reply` from a thrown `redirect_exception`, with a url for the `Location` header and an optional `Retry-After` timeout value. (cherry picked from commit 25fbd79)

Seastar has a C++20 modules variation of the build in GHA, but our changes have left it quite broken as we don't use modules. Disable this build for now so we can get the build green and protect the seastar branch.

Pure non-functional refactor to split the long line into three. (cherry picked from commit a0155ba)

Switch the io-queue cost function over to use max(IOPS cost, TP cost) away from sum(IOPS cost, TP cost). In its form today the io-queue fails to reach the advertised throughput/IOPS numbers from io-properties (see [seastar#2771](scylladb#2771) for details). The reason for this is easy to understand when looking at a disk like the ones from the m7gd AWS series. They (approximately) achieve full throughput as well as full IOPS at 4k IOP size. This means that for cost modelling for a single IO operation it would be enough to just look at one dimension, either throughput or IOPS but not both. The current cost function which is always additive in IOP and throughput cost however costs operation way too high. This leads to the aforementioned problem where the io-queue underperforms. Taking 4k IOPS on m7gd as an example we only get around 60% of the disk performance. From various testing the max cost function seems to model disks better than the sum cost function and avoids the problem mentioned above. Note this doesn't change the "outer" cost model of how the costs of individual operations are summed together. That also isn't modelled entirely right for more modern disks but is pessimistically fine and not the focus of this patch. The problem can be easily shown using the following io-tester config: ``` - name: highprio shards: [0] type: randwrite shard_info: parallelism: 32 reqsize: 4kB|16kB|128kB shares: 1000 think_time: 0 - name: lowprio shards: [0] type: randwrite shard_info: parallelism: 32 reqsize: 4kB|16kB|128kB shares: 100 think_time: 0 ``` On m7gd.4xl (all numbers in MB/s): - NOIP: Not using io-properties - IP: Using io-properties - MCF: Using io-properties with the max cost function | | 4k | | | 16k | | | 128k | | | | | high prio | low prio | total | high prio | low prio | total | high prio | low prio | total | | NOIP | 303 | 287 | 591 | 299 | 292 | 591 | 295 | 295 | 590 | | IP | 262 | 28 | 290 | 432 | 44 | 476 | 528 | 53 | 581 | | MCF | 476 | 56 | 532 | 543 | 55 | 598 | 544 | 55 | 599 | We see that at low IOP size using no io-properties we do get the full throughput but no scheduling (bad), using io-properties we do get scheduling but low throughput and with the max cost function we get both. More data and extensive discussion can be found here: - scylladb#2801 (comment) - scylladb#2840 (comment) - https://docs.google.com/spreadsheets/d/1u_9tRxePkq-OYFiba8xz4JY54P2VJ-JQBvhuBv04tmo These tests also include latency tests in priority scenarios and different kind of disks like EBS. The linked PR also includes a discussion about a different approach using a weighted cost function. Note that in general latency is often a function of throughput as the io-queue arbitrarily keeping things in the queue induces latency. This is especially true for real world usecases like RP where write amplification is generally fairly high but at the same time can be absorbed at the cost of latency. (cherry picked from commit 9bf2094)

Add a scoped_system_alloc_fallback object, which changes the behavior when at least one of these objects is in scope: when a large allocation would otherwise fail, we instead fall back to the system allocator. This would allow limited use of the system allocator in cases where it is useful to do large allocations, which might otherwise fail due to fragmentation.

Disable dpdk build as we don't use it an it is long and breaks now and then: it still runs upstream

Will be caught in follow up commit. It is possible for the connection to have closed before these functions are called. If this occurs then the program aborts. Signed-off-by: Michael Boquard <michael@redpanda.com>

As seen in scylladb#1784, having flush retry `io_submit` in a loop might be futile if the kernel internal queue is full and we need to reap completions to free up space for more iocb:s. This change breaks out of the loop as soon as `io_submit` fails and just moves the remaining entries in the iocbs queue into the front so more entries can be queued (otherwise, we'd need a circular queue that will penalize the error-free path for the unlikely event of an error) Fixes scylladb#1784 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 5138d5f)

EAGAIN is expected here when "Insufficient resources are available to queue any iocbs" (see io_submit(2)). Abort on any other error, as those indicate an internal error on our side. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 9fff5f3)

The `aio_general_context` had the implicit assumption that in a single tick we would never queue more than `--max-networking-io-control-blocks` events/iocbs. This however ignores situations such as queuing multiple iocbs per socket per tick, having left over iocbs in the queue from previous ticks via the new EAGAIN handling or simply because a lot more sockets are in use which isn't prevented anywhere else. If this condition was hit (`last > end`) the reactor would just assert out and crash. To avoid this situation this patch introduces a backlog into which elements are being enqueued when the original array is full and which can grow unbounded. This mirrors how the `aio_storage_context` works which uses the `io_sink` for the same purpose. To avoid oversized allocations after startup the split into two separate data structures is needed (instead of just regrowing the array). Further the datastructure from which the iocbs are passed into `io_submit` needs to be in contigiuous memory (and also provide an API to use it which most containers don't). `std::deque` is used in the backlog to avoid oversized allocations in the backlog itself. The existing array solution for `iocbs` is kept to fulfill the contigiuous memory requirement. Further we slightly change how EAGAIN is handled. Instead of backshifting the array we keep the array as is and just track the `begin` of the array across `flush` calls. This is possible now as the backlog handling is in place. This introduces "batching" and prevents degenerate cases where only a single element is being submitted which would then result in repeated shifting of the whole array on each `flush` call. Given we use a chunked data structure like `std::deque` erasing from the front of the backlog is relatively cheap and does not require shifting all the elements in the backlog. Hence, the per-iocb overhead is amortized constant. Note that in general we try to submit as many iocbs per `io_submit` call. Given the new behaviour of not backshifting the iocb array and immediately backfilling from the backlog we might introduce `io_submit` calls that don't try to submit the max amount of iocbs. However we assume that if we ran into EAGAIN then either: - We are still behind the next time around: it's unlikely we would succeed in submitting all the iocbs anyway - We have now caugt up: we have introduced a single additional `io_submit` call which only submits `max_poll()/2` iocbs on average. The backlog will be drained at full `max_poll` per `io_submit`. (cherry picked from commit d9175fc)

http::request::content is deprecated upstream, with the idea that you set the server into streaming mode and use the input_stream<> in the request directly. This is not a totally trivial change, so for now we want to just keep using content as this functionality is the same as always, so we remove the deprecation from our fork for now. See also CORE-15051.

…back mode When using scoped_system_alloc_fallback, large allocations are expected and intentional. Reduce the log level from warn to debug to avoid spamming logs in this case. Also change the message text to avoid triggering the BLL check and skip the backtrace since it's not useful for expected allocations.

Add struct cert_info (serial number + expiry) to the public API and implement get_cert_info() and get_trust_list_info() on certificate_credentials, backed by virtual methods on credentials_impl. Both GnuTLS and OpenSSL backends extract serial and expiry from loaded certificates and trust lists. Port of the following commits from v26.2.x-pre onto the upstream crypto provider architecture: b8438b3 net/tls: Introduce cert_info and accessors e4c696a net/tls(ossl): Introduce cert_info and accessors 76b82e3 net/tls: Adjust type for cert_info.serial 06eaf07 net/tls: Replace cert_info::bytes with vector<byte>

Add a new reload_callback_with_creds callback type that receives the reloaded certificate_credentials and trust file blob, in addition to the changed files and exception_ptr. This allows callers to inspect the new credentials (e.g., via get_cert_info()) at reload time without having to rebuild them. Add credentials_builder::get_trust_file_blob() to retrieve the loaded trust file contents. Add build_reloadable_{certificate,server}_credentials overloads accepting reload_callback_with_creds. Add test_reload_certificates_with_creds test case. Port of the following commits from v26.2.x-pre onto the upstream crypto provider architecture: f716e6a net/tls: Add reload_callback_with_creds 232567e tls: Include trust file contents with reload callback 3f86a53 tls_test: Add tests for new reload callback and cert_info accessors

Add enum class dn_format { legacy, rfc2253 } and an overload of get_dn_information() that accepts a format parameter. The OpenSSL backend switches X509_NAME_print_ex flags based on the format: XN_FLAG_RFC2253 for rfc2253, and the legacy seastar flags for legacy. GnuTLS ignores the parameter as it does not provide a mechanism to change the DN output format. Port of the following commit from v26.2.x-pre onto the upstream crypto provider architecture: 291dc51 tls: Added support for fetching DN in RFC2253 format

OpenSSL's API contract is too loose and the impact too wide (e.g. low-priority HTTPS traffic could crash the whole process) that it makes sense to only terminate here in debug builds. Restores RP-specific behavior on top of upstream PR scylladb#3369 (which removed the assert entirely in favor of just logging). Uses plain assert() rather than SEASTAR_ASSERT to fire only in debug builds. Port of v26.2.x-pre commit 4152d2f onto upstream's verify_clean_error_queue function (different file: tls_openssl.cc vs the original ossl.cc).

dotnwat

nothing suspicious in TLS land. maybe @pgellert can also take a quick peek too.

pgellert

I took a look at the TLS changes, and they look good to me

travisdowns · 2026-05-08T21:23:57Z

@StephanDollberg wrote:

Did you try dropping this one. I feel like we shouldn't need it anymore after libfmt upgrade.

242f6d2

Same for:

2b4725b

Both have been dropped. The latter needs changes on RP side, which have been pushed as redpanda-data/redpanda#30395.

mmaslankaprv and others added 30 commits May 5, 2026 16:58

http: enable specifying a content type on exceptions

c39663b

Since an exception carries some text for the response body text, the raising site might like to specify the content type if it's e.g. json. Signed-off-by: John Spray <jcs@vectorized.io>

http: don't jsonize exception if has content type

ced4b17

This enables throwing a base_exception from a json request handler with a json payload inside it. Signed-off-by: John Spray <jcs@vectorized.io>

http: use base_exception content type in non-json errors

9430122

Signed-off-by: John Spray <jcs@vectorized.io>

metrics: expose metric impl handle to internal api

0ee1486

This patch extends the metrics internal apis to use a specific metrics::impl::impl object identified by its integer handle. (cherry picked from commit 6ee4af7)

metrics: expose handle in metric_groups_impl

5d78395

Add a public method to metric_groups_impl that exposes the handle of the internal implementation it is using. This is required in order for the metric_groups class to be able to reset itself to the configured implementation handle.

metrics: Use handle for impl object

151823d

This patch removes two subsequent calls to `get_local_impl` and reuses the returned handle in that scope.

scollectd: select internal metrics implementation

3f737ea

This patch extends the scollectd apis with the ability to select the internal metrics implementation to be used by providing a handle. (cherry picked from commit d4331d1)

metrics: Expose 'skip_when_empty' for metrics

55dc12b

This patch adds a 'get_skip_when_empy' getter to the 'registered_metric' class. It is used by follow-up patches in order to replicate metrics.

metrics: add helpers for creation of replicas

2179447

This patch adds private methods to the 'metrics::impl' class that deal with the creation of replicated metrics. They will be used to build the public api in future commits.

metrics: add metric replication internal interface

69a4762

This patch adds a public method to the 'metrics::impl' class: 'set_metric_families_to_replicate'. When this method is called the families that match any of the specifications will be replicated on the specified destinations.

metrics: public family replication interface

f70c6f8

This patch exposes a method in the public interface of the metrics module ('replicate_metric_families'), which enables metric replication internally for the requested metric families.

tests: add metrics replication unit tests

6db1233

scheduling_group: expose usage statistics

a618fe8

This commit extends the public interface of scheduling_group to expose usage statistics (e.g. runtime).

http: rename http client logger

dea259b

Redpanda uses this logger name for its http client and we choose to change the logger name in Seastar to avoid duplicate logger registration exceptions.

tests/unit: refactor stall_detector test functions

e33f233

Moves some functions and classes defined in the stall_detector test to a separate header file so they can be reused in other tests.

core: cpu profiler implementation

aaf35ae

tests/cpu_profiler: add test to verify that on_signal doesn't allocate

c4e6f1b

tests/cpu_profiler: add tests that ensure correct behavior during exc…

3b99c40

…eptions

core/cpu_profiler: finer-grain stats for why a sample was dropped

e05aa93

Store original values from io-priorities yaml

ac338c7

in the configuration of the io_group. The io_groups can be used to get the original values which are needed to implement throttling in Redpanda.

travisdowns requested review from StephanDollberg and dotnwat May 7, 2026 03:34

travisdowns changed the title ~~Proposed v26.2.x: 70 RP commits on scylladb/master c302102b7~~ Proposed v26.2.x: 68 RP commits on scylladb/master c302102b7 May 7, 2026

michael-redpanda and others added 18 commits May 7, 2026 14:19

http: add to_reply() in httpd::redirect_exception

2b20c7b

This function allows for the creation of an `http::reply` from a thrown `redirect_exception`, with a url for the `Location` header and an optional `Retry-After` timeout value. (cherry picked from commit 25fbd79)

GHA: disable modules build

7a0f97e

Seastar has a C++20 modules variation of the build in GHA, but our changes have left it quite broken as we don't use modules. Disable this build for now so we can get the build green and protect the seastar branch.

io-queue: Refactor cost function

a40f9b5

Pure non-functional refactor to split the long line into three. (cherry picked from commit a0155ba)

GHA: disable dppk

55566a5

Disable dpdk build as we don't use it an it is long and breaks now and then: it still runs upstream

net: Permit exceptions out of remote and local address

71b1a55

Will be caught in follow up commit. It is possible for the connection to have closed before these functions are called. If this occurs then the program aborts. Signed-off-by: Michael Boquard <michael@redpanda.com>

opt out of AI training per GitHub policy

bfb7a37

dotnwat reviewed May 7, 2026

View reviewed changes

pgellert self-requested a review May 8, 2026 07:12

pgellert reviewed May 8, 2026

View reviewed changes

travisdowns force-pushed the proposed-v26.2.x branch from 8d712a2 to 097536f Compare May 8, 2026 14:58

travisdowns mentioned this pull request May 8, 2026

bazel: switch seastar pin to rebased v26.2.x branch redpanda-data/redpanda#30428

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed v26.2.x: 68 RP commits on scylladb/master c302102b7#277

Proposed v26.2.x: 68 RP commits on scylladb/master c302102b7#277
travisdowns wants to merge 66 commits into
redpanda-data:proposed-v26.2.x-merge-basefrom
travisdowns:proposed-v26.2.x

travisdowns commented May 7, 2026 •

edited

Loading

Uh oh!

StephanDollberg commented May 7, 2026

Uh oh!

travisdowns commented May 7, 2026

Uh oh!

travisdowns commented May 7, 2026

Uh oh!

dotnwat left a comment

Uh oh!

pgellert left a comment

Uh oh!

travisdowns commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

travisdowns commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference

What's in here

Reading the diff

Additional details

Uh oh!

StephanDollberg commented May 7, 2026

Uh oh!

travisdowns commented May 7, 2026

Uh oh!

travisdowns commented May 7, 2026

Uh oh!

dotnwat left a comment

Choose a reason for hiding this comment

Uh oh!

pgellert left a comment

Choose a reason for hiding this comment

Uh oh!

travisdowns commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

travisdowns commented May 7, 2026 •

edited

Loading