Proposed v26.2.x: 68 RP commits on scylladb/master c302102b7#277
Proposed v26.2.x: 68 RP commits on scylladb/master c302102b7#277travisdowns wants to merge 66 commits into
Conversation
Seastar http server implementation supports multiple listeners. It may be required for the handler logic to know which listener the connection is coming from. Added listener_idx field to `httpd::request` to allow handler recognize listener. Signed-off-by: Michal Maslanka <michal@vectorized.io>
Since an exception carries some text for the response body text, the raising site might like to specify the content type if it's e.g. json. Signed-off-by: John Spray <jcs@vectorized.io>
This enables throwing a base_exception from a json request handler with a json payload inside it. Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
Prior to this patch seastar only exposes one global metrics::impl::impl object which holds all metric related data for one application. This patch changes the implementation details such that multiple metrics::impl::impl objects can exist for any given application. Said objects are stored into a map on each shard and created dinamically whenever requested. A metrics::impl::impl is identified by an integer handle that acts as the key for the storage map. Implementation note: in order to avoid issues caused by the ordering of static thread_local objects I had to declare the storage in reactor.cc. (cherry picked from commit 585a8af)
This patch extends the metrics internal apis to use a specific metrics::impl::impl object identified by its integer handle. (cherry picked from commit 6ee4af7)
Add a public method to metric_groups_impl that exposes the handle of the internal implementation it is using. This is required in order for the metric_groups class to be able to reset itself to the configured implementation handle.
This patch extends the metrics user facing apis to use a specific metrics::impl::impl object identified by its integer handle. Note that the constructor of 'metric_groups' is marked explicit in this patch and updates two call sites where the constructor was used implicitly.
This patch removes two subsequent calls to `get_local_impl` and reuses the returned handle in that scope.
This patch extends the user facing prometheus apis allowing the user to specify the internal metrics implementation to be used through a handle. Additionally, 'add_prometheus_routes' now takes an argument that specifies the route on which to advertise the metrics. This enables different metrics "namespaces" to be served by different endpoints in isolation. (cherry picked from commit 6189522)
This patch extends the scollectd apis with the ability to select the internal metrics implementation to be used by providing a handle. (cherry picked from commit d4331d1)
This patch adds a 'get_skip_when_empy' getter to the 'registered_metric' class. It is used by follow-up patches in order to replicate metrics.
This patch adds private methods to the 'metrics::impl' class that deal with the creation of replicated metrics. They will be used to build the public api in future commits.
This patch adds private helpers to 'metrics::impl' that deal with the removal of replicated metric families from their destintation implementation. These methods will be used in subsequent commits to manage the lifetime of replicated metrics.
This patch adds a public method to the 'metrics::impl' class: 'set_metric_families_to_replicate'. When this method is called the families that match any of the specifications will be replicated on the specified destinations.
This patch extends the metric registration and unregistration processes to make them aware of metric replication. In the case of metric registration, if the new metric belongs to a family that matches one of the replication specs, then a replicated metric is created accordingly. For unregistration of a metric, the replicated metric is unregistered too if one exists.
This patch exposes a method in the public interface of the metrics
module ('replicate_metric_families'), which enables metric replication
internally for the requested metric families.
Extends the metrics api to allow changing the aggregation labels of a metrics family. Otherwise one had to un-register every single metric instance in a metric family and then re-register with the changed aggregation labels. For metric families with thousands of instances (e.g.: histograms with lots of different labels) this is quite expensive. With this change we avoid the full reconstruction of the metrics family and all its metrics. Only the work associated with marking the metrics `dirty()` is needed then.
This commit extends the public interface of scheduling_group to expose usage statistics (e.g. runtime).
Redpanda uses this logger name for its http client and we choose to change the logger name in Seastar to avoid duplicate logger registration exceptions.
We sort of inadvertently picked up the change to io_uring when rebasing our seastar fork, but for the coming release we'd like to keep using aio to reduce risk and give sufficient time to do performance tests on io_uring. This effectively rolls back the upstream commit: eedca15 We simply put aio and epoll above io_uring in the available list, the default backend is the first one in that list Issue redpanda-data/redpanda#10105.
This change is to allow for the timer code in the stall detectors to be used in a profiler implementation. The hope is to be able to reuse a lot of the stall_detector codebase in the profiler without complicating the existing stall_detector implementation. The new posix_timer constructor sets sigev_notify_thread_id via the macro that musl libc / FreeBSD / Linux define for that field, mirroring upstream commit bbe1af3 (which patched the cpu_stall_detector_posix_timer constructor but predated and so missed the new posix_timer site). see https://sourceware.org/bugzilla/show_bug.cgi?id=27417
Moves some functions and classes defined in the stall_detector test to a separate header file so they can be reused in other tests.
In libgcc there is a critical section where the stack is being modified so execution can return to an exceptions landing pad. However, this partially modified state isn’t valid or specified by the dwarf info in the eh_frames. So a seg fault tends to occur when `backtrace()` tries to unwind through this partially modified stack and follows an invalid pointer in it.
in the configuration of the io_group. The io_groups can be used to get the original values which are needed to implement throttling in Redpanda.
|
Can this one be upstreamed: These guys I thought you had upstreamed already or do we even still need them. IIRC they did change things in this area: Did you try dropping this one. I feel like we shouldn't need it anymore after libfmt upgrade. Same for: If there is more work needed let me know, I will do it. |
I tried: scylladb#2799 (comment), but it was rejected. That ended with a proposal to do it another way, see scylladb#2813 for some prep work, but the rest never happened. |
done: scylladb#3396 |
Allows for setting the certificate and key for the client demo and setting the CA for the server demo. Additionally did some refactoring to print ou DN verification callback in the server. Signed-off-by: Michael Boquard <michael@redpanda.com> Co-authored-by: Michael Boquard <michael@redpanda.com> Co-authored-by: Rob Blafford <rob@redpanda.com>
This function allows for the creation of an `http::reply` from a thrown `redirect_exception`, with a url for the `Location` header and an optional `Retry-After` timeout value. (cherry picked from commit 25fbd79)
Seastar has a C++20 modules variation of the build in GHA, but our changes have left it quite broken as we don't use modules. Disable this build for now so we can get the build green and protect the seastar branch.
Pure non-functional refactor to split the long line into three. (cherry picked from commit a0155ba)
Switch the io-queue cost function over to use max(IOPS cost, TP cost) away from sum(IOPS cost, TP cost). In its form today the io-queue fails to reach the advertised throughput/IOPS numbers from io-properties (see [seastar#2771](scylladb#2771) for details). The reason for this is easy to understand when looking at a disk like the ones from the m7gd AWS series. They (approximately) achieve full throughput as well as full IOPS at 4k IOP size. This means that for cost modelling for a single IO operation it would be enough to just look at one dimension, either throughput or IOPS but not both. The current cost function which is always additive in IOP and throughput cost however costs operation way too high. This leads to the aforementioned problem where the io-queue underperforms. Taking 4k IOPS on m7gd as an example we only get around 60% of the disk performance. From various testing the max cost function seems to model disks better than the sum cost function and avoids the problem mentioned above. Note this doesn't change the "outer" cost model of how the costs of individual operations are summed together. That also isn't modelled entirely right for more modern disks but is pessimistically fine and not the focus of this patch. The problem can be easily shown using the following io-tester config: ``` - name: highprio shards: [0] type: randwrite shard_info: parallelism: 32 reqsize: 4kB|16kB|128kB shares: 1000 think_time: 0 - name: lowprio shards: [0] type: randwrite shard_info: parallelism: 32 reqsize: 4kB|16kB|128kB shares: 100 think_time: 0 ``` On m7gd.4xl (all numbers in MB/s): - NOIP: Not using io-properties - IP: Using io-properties - MCF: Using io-properties with the max cost function | | 4k | | | 16k | | | 128k | | | | | high prio | low prio | total | high prio | low prio | total | high prio | low prio | total | | NOIP | 303 | 287 | 591 | 299 | 292 | 591 | 295 | 295 | 590 | | IP | 262 | 28 | 290 | 432 | 44 | 476 | 528 | 53 | 581 | | MCF | 476 | 56 | 532 | 543 | 55 | 598 | 544 | 55 | 599 | We see that at low IOP size using no io-properties we do get the full throughput but no scheduling (bad), using io-properties we do get scheduling but low throughput and with the max cost function we get both. More data and extensive discussion can be found here: - scylladb#2801 (comment) - scylladb#2840 (comment) - https://docs.google.com/spreadsheets/d/1u_9tRxePkq-OYFiba8xz4JY54P2VJ-JQBvhuBv04tmo These tests also include latency tests in priority scenarios and different kind of disks like EBS. The linked PR also includes a discussion about a different approach using a weighted cost function. Note that in general latency is often a function of throughput as the io-queue arbitrarily keeping things in the queue induces latency. This is especially true for real world usecases like RP where write amplification is generally fairly high but at the same time can be absorbed at the cost of latency. (cherry picked from commit 9bf2094)
Add a scoped_system_alloc_fallback object, which changes the behavior when at least one of these objects is in scope: when a large allocation would otherwise fail, we instead fall back to the system allocator. This would allow limited use of the system allocator in cases where it is useful to do large allocations, which might otherwise fail due to fragmentation.
Disable dpdk build as we don't use it an it is long and breaks now and then: it still runs upstream
Will be caught in follow up commit. It is possible for the connection to have closed before these functions are called. If this occurs then the program aborts. Signed-off-by: Michael Boquard <michael@redpanda.com>
As seen in scylladb#1784, having flush retry `io_submit` in a loop might be futile if the kernel internal queue is full and we need to reap completions to free up space for more iocb:s. This change breaks out of the loop as soon as `io_submit` fails and just moves the remaining entries in the iocbs queue into the front so more entries can be queued (otherwise, we'd need a circular queue that will penalize the error-free path for the unlikely event of an error) Fixes scylladb#1784 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 5138d5f)
EAGAIN is expected here when "Insufficient resources are available to queue any iocbs" (see io_submit(2)). Abort on any other error, as those indicate an internal error on our side. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 9fff5f3)
The `aio_general_context` had the implicit assumption that in a single tick we would never queue more than `--max-networking-io-control-blocks` events/iocbs. This however ignores situations such as queuing multiple iocbs per socket per tick, having left over iocbs in the queue from previous ticks via the new EAGAIN handling or simply because a lot more sockets are in use which isn't prevented anywhere else. If this condition was hit (`last > end`) the reactor would just assert out and crash. To avoid this situation this patch introduces a backlog into which elements are being enqueued when the original array is full and which can grow unbounded. This mirrors how the `aio_storage_context` works which uses the `io_sink` for the same purpose. To avoid oversized allocations after startup the split into two separate data structures is needed (instead of just regrowing the array). Further the datastructure from which the iocbs are passed into `io_submit` needs to be in contigiuous memory (and also provide an API to use it which most containers don't). `std::deque` is used in the backlog to avoid oversized allocations in the backlog itself. The existing array solution for `iocbs` is kept to fulfill the contigiuous memory requirement. Further we slightly change how EAGAIN is handled. Instead of backshifting the array we keep the array as is and just track the `begin` of the array across `flush` calls. This is possible now as the backlog handling is in place. This introduces "batching" and prevents degenerate cases where only a single element is being submitted which would then result in repeated shifting of the whole array on each `flush` call. Given we use a chunked data structure like `std::deque` erasing from the front of the backlog is relatively cheap and does not require shifting all the elements in the backlog. Hence, the per-iocb overhead is amortized constant. Note that in general we try to submit as many iocbs per `io_submit` call. Given the new behaviour of not backshifting the iocb array and immediately backfilling from the backlog we might introduce `io_submit` calls that don't try to submit the max amount of iocbs. However we assume that if we ran into EAGAIN then either: - We are still behind the next time around: it's unlikely we would succeed in submitting all the iocbs anyway - We have now caugt up: we have introduced a single additional `io_submit` call which only submits `max_poll()/2` iocbs on average. The backlog will be drained at full `max_poll` per `io_submit`. (cherry picked from commit d9175fc)
http::request::content is deprecated upstream, with the idea that you set the server into streaming mode and use the input_stream<> in the request directly. This is not a totally trivial change, so for now we want to just keep using content as this functionality is the same as always, so we remove the deprecation from our fork for now. See also CORE-15051.
…back mode When using scoped_system_alloc_fallback, large allocations are expected and intentional. Reduce the log level from warn to debug to avoid spamming logs in this case. Also change the message text to avoid triggering the BLL check and skip the backtrace since it's not useful for expected allocations.
Add struct cert_info (serial number + expiry) to the public API and implement get_cert_info() and get_trust_list_info() on certificate_credentials, backed by virtual methods on credentials_impl. Both GnuTLS and OpenSSL backends extract serial and expiry from loaded certificates and trust lists. Port of the following commits from v26.2.x-pre onto the upstream crypto provider architecture: b8438b3 net/tls: Introduce cert_info and accessors e4c696a net/tls(ossl): Introduce cert_info and accessors 76b82e3 net/tls: Adjust type for cert_info.serial 06eaf07 net/tls: Replace cert_info::bytes with vector<byte>
Add a new reload_callback_with_creds callback type that receives the
reloaded certificate_credentials and trust file blob, in addition to the
changed files and exception_ptr. This allows callers to inspect the new
credentials (e.g., via get_cert_info()) at reload time without having to
rebuild them.
Add credentials_builder::get_trust_file_blob() to retrieve the loaded
trust file contents. Add build_reloadable_{certificate,server}_credentials
overloads accepting reload_callback_with_creds.
Add test_reload_certificates_with_creds test case.
Port of the following commits from v26.2.x-pre onto the upstream
crypto provider architecture:
f716e6a net/tls: Add reload_callback_with_creds
232567e tls: Include trust file contents with reload callback
3f86a53 tls_test: Add tests for new reload callback and cert_info accessors
Add enum class dn_format { legacy, rfc2253 } and an overload of
get_dn_information() that accepts a format parameter. The OpenSSL
backend switches X509_NAME_print_ex flags based on the format:
XN_FLAG_RFC2253 for rfc2253, and the legacy seastar flags for legacy.
GnuTLS ignores the parameter as it does not provide a mechanism to
change the DN output format.
Port of the following commit from v26.2.x-pre onto the upstream
crypto provider architecture:
291dc51 tls: Added support for fetching DN in RFC2253 format
OpenSSL's API contract is too loose and the impact too wide (e.g. low-priority HTTPS traffic could crash the whole process) that it makes sense to only terminate here in debug builds. Restores RP-specific behavior on top of upstream PR scylladb#3369 (which removed the assert entirely in favor of just logging). Uses plain assert() rather than SEASTAR_ASSERT to fire only in debug builds. Port of v26.2.x-pre commit 4152d2f onto upstream's verify_clean_error_queue function (different file: tls_openssl.cc vs the original ossl.cc).
pgellert
left a comment
There was a problem hiding this comment.
I took a look at the TLS changes, and they look good to me
8d712a2 to
097536f
Compare
|
@StephanDollberg wrote:
Both have been dropped. The latter needs changes on RP side, which have been pushed as redpanda-data/redpanda#30395. |
This PR shows the proposed contents of the v26.2.x branch after rebase for review. It contains 66 redpanda-specific commits on top of
scylladb/masteratc302102b7. I don't expect anyone to review these lines by line as these have already been largely reviewed when originally checked in, but more look at the overall approach and maybe do spotchecks.
@dotnwat put you on here especially for the OpenSSL stuff.
Reference
proposed-v26.2.x-merge-base):c302102b7c3ac10a02723167dcb155be908b135c—scylladb/mastertip as of the rebase point. This is a frozen reference for the diff.travisdowns:proposed-v26.2.x):c6d7666d0081d7ea8fd10968525e986ba25a2c43— the proposed v26.2.x with all RP commits replayed.What's in here
66 commits, broken down as:
A further commits from v26.2.x-pre were dropped or upstreamed during this work.
Reading the diff
The TLS surface in particular saw the biggest changes because upstream merged a pluggable crypto provider rewrite (Noah Watkins, scylladb#3360 series); our RP-specific TLS features (
cert_info,reload_callback_with_creds,dn_format) were rewritten against that new architecture rather than carried forward as-is.Additional details
Two comments on this PR have additional context:
v26.2.x-precommit with its disposition in this PR (clean / edits / ported / upstreamed / dropped).