Avoid io_context contention in high-throughput SSL stream reads by filling buffers fully#1712
Open
hofst wants to merge 1 commit intochriskohlhoff:masterfrom
Open
Avoid io_context contention in high-throughput SSL stream reads by filling buffers fully#1712hofst wants to merge 1 commit intochriskohlhoff:masterfrom
hofst wants to merge 1 commit intochriskohlhoff:masterfrom
Conversation
…lling buffers fully The current implementation of read_some in asio::ssl decodes only one TLS segment per operation. A TLS maximum segment size is 16KB. This leads to small reads per io_context operation. High-Throughput real-world scenarios observe as little as 9KB buffer utilization per operation, causing significant overhead and thread overscheduling when high throughput is required. This is relevant because it is hard to get much more than ~600k operations per second from a single io_context. While multiple io-contexts are possible, the overhead also applies to multiple io_contexts. Moreover, implementations get significantly more complex when multiple io_contexts are required which require dedicated load-balancing and scheduling. On our production machines (e.g., `r8i.48xlarge` with a 75Gb/s interface or `x2idn.32xlarge` with a 100Gb/s interface), we see significant contention and a maximum network throughput of ~25Gb/s at very high (up to 100%) CPU utilization due to contention for concurrent S3 downloads. With this PR, the system CPU utilization drops to ~10% while throughput increases to 70GB/s and 92Gb/s respectively. This PR modifies the read operation to loop multiple reads until either: 1. There is no more data in the system buffer (would block). 2. The user-provided buffer is full. Additionally, the internal buffer sizes are increased from 17KB to 128KB. This part is open for suggestions - should the buffer sizes be configurable for high-throughput scenarios, e.g., at runtime or via compile-time macros?
This was referenced Feb 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current implementation of read_some in asio::ssl decodes only one TLS segment per operation. The TLS maximum segment size is 16KB. This leads to small reads per io_context operation. High-Throughput real-world scenarios observe as little as 9KB buffer utilization per operation, causing significant overhead and thread overscheduling when high throughput is required. This is relevant because it is hard to get much more than ~600k operations per second from a single io_context. While multiple io-contexts are possible, the overhead also applies to multiple io_contexts. Moreover, implementations get significantly more complex when involving multiple io_contexts which in turn require dedicated load-balancing and scheduling for hetereogeneous loads.
On our production machines (e.g.,
r8i.48xlargewith a 75Gb/s interface orx2idn.32xlargewith a 100Gb/s interface), we see significant contention and a maximum network throughput of ~25Gb/s at very high (up to 100%) CPU utilization due to contention for concurrent S3 downloads. With this PR, the system CPU utilization drops to ~10% while throughput increases to 70GB/s and 92Gb/s respectively.This PR modifies the read operation to loop multiple reads until either:
Additionally, the internal buffer sizes are increased from 17KB to 128KB. This part is open for suggestions - should the buffer sizes be configurable for high-throughput scenarios, e.g., at runtime or via compile-time macros?
Happy to get feedback on the approach!