lsm/io: round read buffer by max of memory/disk DMA alignment#30443
Conversation
ioarray::aligned() requires its size to be a multiple of its alignment. disk_file_reader::read passed memory_dma_alignment() as the alignment but rounded the size by disk_read_dma_alignment(), which is a weaker constraint. When the two alignments differ, the size isn't a multiple of the alignment and the precondition fires. Needed by the seastar v26.2.x rebase: rebased seastar (via upstream "file: Apply physical_block_size override to filesystem files") returns disk_read_dma_alignment() = 512 (the O_DIRECT minimum) on non-XFS filesystems instead of the previous 4096 default, exposing the mismatch.
There was a problem hiding this comment.
Pull request overview
Adjusts LSM disk read buffer sizing to satisfy ioarray::aligned()’s requirement that size % alignment == 0 when Seastar’s memory_dma_alignment() and disk_read_dma_alignment() differ (notably after the Seastar v26.2.x alignment behavior change).
Changes:
- Round the read buffer size up using
std::max(memory_dma_alignment, disk_read_dma_alignment)to ensureioarray::aligned(memory_alignment, size)preconditions hold across differing alignments. - Add the required
<algorithm>include forstd::max.
| size_t disk_alignment = _file.disk_read_dma_alignment(); | ||
| // ioarray requires its size to be a multiple of its alignment, so use the | ||
| // max of the two seastar alignments. | ||
| size_t max_alignment = std::max(memory_alignment, disk_alignment); |
There was a problem hiding this comment.
I guess this works because we make the (totally reasonable) assumption that the larger of these two is divisible by the smaller (e.g. 4k/512)
There was a problem hiding this comment.
Yea in practice this feels like it'll always bee the case, but to capture intent maybe we should use std::lcm()?
There was a problem hiding this comment.
i don't think we need to do anything, but if we wanted to do something i'd advocate for an assertion.
There was a problem hiding this comment.
I guess this works because we make the (totally reasonable) assumption that the larger of these two is divisible by the smaller (e.g. 4k/512)
These are always powers of two, effectively guaranteed by the kernel/sesatar (which both rely on it), though under-documented, so the condition holds under that assumption. That means the smaller always divides the larger.
i don't think we need to do anything, but if we wanted to do something i'd advocate for an assertion.
We almost have that assertion, the same one that triggered this fix, in the ioarray constructor (the 2nd one):
dassert(
(alignment & (alignment - 1)) == 0,
"alignment {} must be a power of two",
alignment);
dassert(
size % alignment == 0,
"size {} must be a multiple of alignment {}",
size,
It's not exactly the same since it depends on the specific size, so it wouldn't trigger every time if we had a not divisible relationship.
I think I'd rather document the guarantee in seastar and assert there. Let me try that.
There was a problem hiding this comment.
scylladb/seastar#3399 to lock in the n^2 guarantee
disk_file_reader::readpassesmemory_dma_alignment()toioarray::aligned()as the alignment, but rounds the buffer size bydisk_read_dma_alignment().ioarray::aligned()'s precondition issize % alignment == 0, which is only met when those two seastaralignments are equal — which they have been on every current
configuration, so the latent bug never fired.
Exposed by the seastar v26.2.x rebase. Upstream commit
15adc3ce5"file: Apply physical_block_size override to filesystemfiles"
unconditionally runs a new
filesystem_alignments()path on everyposix file, which hardcodes
disk_read = 512(the O_DIRECT minimum)for non-XFS filesystems while leaving
memory_dma_alignmentat itsprior default. XFS goes through its own override path that sets both
memoryanddisk_readfrom the device's logical sector size, so XFSis unaffected.
Confirmed locally with a small
open_file_dmaprobe built againstboth the old and new seastar SHAs:
memory=512, disk_read=512memory=512, disk_read=512(unchanged)memory=512, disk_read=512memory=512, disk_read=512(unchanged)memory=4096, disk_read=4096memory=4096, disk_read=512memory=4096, disk_read=4096memory=4096, disk_read=512With
memory > disk_read, the precondition fires:across 5
//src/v/lsm/...and//src/v/cloud_topics/...tests inPR #30428's
build-debug-clang-arm64job (whose bazel sandbox lives ona non-XFS filesystem). The failure doesn't reproduce on x86_64 hosts
whose bazel sandbox is on XFS, because there
memory == disk_read == 512and every
align_up(n, 512)trivially divides 512.Round the size up by
std::max(memory_alignment, disk_alignment)sothe ioarray precondition holds while still over-allocating only by the
gap between the two alignments.
Tracked as CORE-16295.
Backports Required
Release Notes