Skip to content

lsm/io: round read buffer by max of memory/disk DMA alignment#30443

Merged
travisdowns merged 1 commit into
redpanda-data:devfrom
travisdowns:td-ioarray-max-alignment
May 12, 2026
Merged

lsm/io: round read buffer by max of memory/disk DMA alignment#30443
travisdowns merged 1 commit into
redpanda-data:devfrom
travisdowns:td-ioarray-max-alignment

Conversation

@travisdowns
Copy link
Copy Markdown
Member

@travisdowns travisdowns commented May 11, 2026

disk_file_reader::read passes memory_dma_alignment() to
ioarray::aligned() as the alignment, but rounds the buffer size by
disk_read_dma_alignment(). ioarray::aligned()'s precondition is
size % alignment == 0, which is only met when those two seastar
alignments are equal — which they have been on every current
configuration, so the latent bug never fired.

Exposed by the seastar v26.2.x rebase. Upstream commit
15adc3ce5 "file: Apply physical_block_size override to filesystem
files"

unconditionally runs a new filesystem_alignments() path on every
posix file, which hardcodes disk_read = 512 (the O_DIRECT minimum)
for non-XFS filesystems while leaving memory_dma_alignment at its
prior default. XFS goes through its own override path that sets both
memory and disk_read from the device's logical sector size, so XFS
is unaffected.

Confirmed locally with a small open_file_dma probe built against
both the old and new seastar SHAs:

Mount FS old (a0b4f2a6) new (097536ff)
/mnt/bazel xfs memory=512, disk_read=512 memory=512, disk_read=512 (unchanged)
/mnt/xfs xfs memory=512, disk_read=512 memory=512, disk_read=512 (unchanged)
/tmp ext4 memory=4096, disk_read=4096 memory=4096, disk_read=512
/home/td ext4 (fscrypt — probably behaves like overlayfs etc.) memory=4096, disk_read=4096 memory=4096, disk_read=512

With memory > disk_read, the precondition fires:

Assert failure: (src/v/bytes/ioarray.cc:59) 'size % alignment == 0'
size 512 must be a multiple of alignment 4096

across 5 //src/v/lsm/... and //src/v/cloud_topics/... tests in
PR #30428's build-debug-clang-arm64 job (whose bazel sandbox lives on
a non-XFS filesystem). The failure doesn't reproduce on x86_64 hosts
whose bazel sandbox is on XFS, because there memory == disk_read == 512
and every align_up(n, 512) trivially divides 512.

Round the size up by std::max(memory_alignment, disk_alignment) so
the ioarray precondition holds while still over-allocating only by the
gap between the two alignments.

Tracked as CORE-16295.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

ioarray::aligned() requires its size to be a multiple of its alignment.
disk_file_reader::read passed memory_dma_alignment() as the alignment
but rounded the size by disk_read_dma_alignment(), which is a weaker
constraint. When the two alignments differ, the size isn't a multiple
of the alignment and the precondition fires.

Needed by the seastar v26.2.x rebase: rebased seastar (via upstream
"file: Apply physical_block_size override to filesystem files") returns
disk_read_dma_alignment() = 512 (the O_DIRECT minimum) on non-XFS
filesystems instead of the previous 4096 default, exposing the
mismatch.
Copilot AI review requested due to automatic review settings May 11, 2026 20:11
@travisdowns travisdowns requested a review from dotnwat May 11, 2026 20:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts LSM disk read buffer sizing to satisfy ioarray::aligned()’s requirement that size % alignment == 0 when Seastar’s memory_dma_alignment() and disk_read_dma_alignment() differ (notably after the Seastar v26.2.x alignment behavior change).

Changes:

  • Round the read buffer size up using std::max(memory_dma_alignment, disk_read_dma_alignment) to ensure ioarray::aligned(memory_alignment, size) preconditions hold across differing alignments.
  • Add the required <algorithm> include for std::max.

Comment thread src/v/lsm/io/file_io.cc
size_t disk_alignment = _file.disk_read_dma_alignment();
// ioarray requires its size to be a multiple of its alignment, so use the
// max of the two seastar alignments.
size_t max_alignment = std::max(memory_alignment, disk_alignment);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works because we make the (totally reasonable) assumption that the larger of these two is divisible by the smaller (e.g. 4k/512)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea in practice this feels like it'll always bee the case, but to capture intent maybe we should use std::lcm()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need to do anything, but if we wanted to do something i'd advocate for an assertion.

Copy link
Copy Markdown
Member Author

@travisdowns travisdowns May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works because we make the (totally reasonable) assumption that the larger of these two is divisible by the smaller (e.g. 4k/512)

These are always powers of two, effectively guaranteed by the kernel/sesatar (which both rely on it), though under-documented, so the condition holds under that assumption. That means the smaller always divides the larger.

i don't think we need to do anything, but if we wanted to do something i'd advocate for an assertion.

We almost have that assertion, the same one that triggered this fix, in the ioarray constructor (the 2nd one):

    dassert(
      (alignment & (alignment - 1)) == 0,
      "alignment {} must be a power of two",
      alignment);
    dassert(
      size % alignment == 0,
      "size {} must be a multiple of alignment {}",
      size,

It's not exactly the same since it depends on the specific size, so it wouldn't trigger every time if we had a not divisible relationship.

I think I'd rather document the guarantee in seastar and assert there. Let me try that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scylladb/seastar#3399 to lock in the n^2 guarantee

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::has_single_bit

amaze

@dotnwat dotnwat requested a review from andrwng May 11, 2026 21:20
@travisdowns travisdowns merged commit 82875b4 into redpanda-data:dev May 12, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants