Skip to content

virtio/block: rewritten and enabled multi-segments#212

Open
dreamliner787-9 wants to merge 6 commits intomainfrom
virtio_block_fixes
Open

virtio/block: rewritten and enabled multi-segments#212
dreamliner787-9 wants to merge 6 commits intomainfrom
virtio_block_fixes

Conversation

@dreamliner787-9
Copy link
Copy Markdown
Contributor

@dreamliner787-9 dreamliner787-9 commented Apr 7, 2026

Depends on #214 for the new virtio descriptor chain R/W utility functions.

Rewritten a majority of the virtio block code to make it easier to read and less bug-prone. Which made virtio block more resillient against buffer overflow, and fixed a bug where the request body was undercounted by 1 descriptor, leading to data corruption when there are multiple data descriptors in 1 request.

Enabled multi-segment requests by increasing seg_max from 1 to 8 to pave way for UEFI and Windows booting. Guests can now make up to 32k size requests, rather than just 4k like before. This can be increased further but risk running out of space in the sDDF Block data region.

Informally documented our entire virtio block device implementation.

Copy link
Copy Markdown
Contributor

@midnightveil midnightveil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've pulled some features out into functions, which is good, for implementing multiple segments. That makes sense.

I still don't like the way that the request overlapping works. It feels very hacky, and definitely is significantly slower because it iterates through the entire set of active requests for every request; which would approach O(n²). It also ignores the existence of our barrier API, and whilst we don't have a defined semantics for these, I suspect it wouldn't work at all.

Also, you combine a rewrite (bug fixes) with adding a feature (multi segments) in one commit, that should really be 2 commits. You can leave in the skeleton functions used for multi-segments and make them only handle one but they really shouldn't be in one commit, there's two many changes to easily tell.

In my mind we need a mental model written down and documented somewhere before we merge this. I can sort of intuit it here, but I'd like to see it explicitly written out the various states requests can be in and how that interacts, so that we can tell later if we run into issues if the code is behaving in a way contrary to our model, or if our model for which the code implements is correct. In other words, I think this deserves an informal abstract specification and proof of correctness against that spec.


As for "better" ways to implement this that isn't O(n²):

  1. At the very least we could use some kind of statically-sized "range tree"-like datastructure. I put "range tree" in quotes because I don't know if there's a canonical name for these, there's probably other names: what I mean is some sort of datastructure that allows you to efficiently query a range in a tree in O(log n) time and check for overlaps, else you insert your new range. This could then store request IDs and this would reduce the cost of future requests. This should be able to be sized correctly to fit the same amount of entries as we have queue entries.

  2. What I think personally is the better solution: take a leaf out of ARM's book with their exclusive monitors. What we are implementing here is the same thing: we have exclusive load/stores to a memory location for which we only allow one to happen at a time. And we want to prevent multiple writers (or readers & writes) at the same time to the same 'exclusive granule'. This could be done by storing, for each 4k region in the buffer window, a single bit indicating whether the region is open (no active requests) / closed (active requests). That would give you O(1) random access at the cost of O(buffer region) space. Then once you have dequeued responses from upwards to the driver you can iterate over the queued requests (intrusive linked list, maybe?) and recheck them or make the exclusive monitor state instead of being a single bit be a queue entry ID or something, trading off storage space in a different dimension here for retrying: hopefully these should be relatively rare given we report optimal IO/buffer alignment.

@dreamliner787-9
Copy link
Copy Markdown
Contributor Author

dreamliner787-9 commented Apr 7, 2026

I suspect it wouldn't work at all.

My understanding is that you are referring to the sDDF Block's Barrier "request" so that we can queue up overlapping reads and writes, then just signal the virtualiser once. But I don't understand what you meant about "wouldn't work" specifically, as in, overlapping requests would still cause problems if we enqueue them all with barriers?

Edit: resolved offline

Also, you combine a rewrite (bug fixes) with adding a feature (multi segments) in one commit, that should really be 2 commits. You can leave in the skeleton functions used for multi-segments and make them only handle one but they really shouldn't be in one commit, there's two many changes to easily tell.

The reason why I combined them is that the original code was already able to recognise multi-segments requests. It just didn't handle them correctly. Plus, enabling multi-segments is just a one line change (changing VIRTIO_BLK_SEG_MAX) and experimenting with multi-segments lead to the rewrite so I just combined them. I can spit them up though if that is more desired.

Edit: resolved offline, will split.

In my mind we need a mental model written down and documented somewhere before we merge this.

Yes this is fair I can write it down.

Edit: resolved offline.

As for "better" ways to implement this that isn't O(n²):

It was a concern when this logic was originally implemented in #160 but we didn't fix it to get it merged quicker, and I think there were some concerns about statically allocating a structure that is tied to the underlying block device's size as well. Since you would need to recompile the code if you increased your block device size.

To properly fix it, I'm gravitating towards the "bitmap" approach as it is simpler than a tree/other structures and comes with a benefit of being able to reuse sDDF's bitarray library. With the bitmap approach, the time complexity will now be O(nk) instead, where n is the number of requests and k is the number of 4k transfer windows in all the requests.

Edit: unresolved offline, a bookkeeping bitmap for 1GiB of disk would need 32KiB of memory which is a fair amount. So it is unclear whether we would want to trade space for time with the bitmap approach, or complexity for time with a tree approach.

>>> 1073741824 / 4096
262144.0
>>> _ / 64
4096.0
>>> _ * 8
32768.0

@dreamliner787-9 dreamliner787-9 force-pushed the virtio_block_fixes branch 4 times, most recently from ed59e76 to 737b52c Compare April 8, 2026 05:28
@dreamliner787-9 dreamliner787-9 force-pushed the virtio_block_fixes branch 4 times, most recently from ff9ebbe to 5307091 Compare April 9, 2026 04:57
Previously virtio MMIO devices have a virtqueue size of 128 while
PCI have 256. Now we make them both 128 for consistency.

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Also made the device more robust against TX queue full condition, and
noted down an existing issue where requests larger than TX queue
capacity aren't handled properly.

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Rewritten a majority of the code to make it easier to read and less
bug-prone. Which made virtio block more resillient against buffer
overflow, and fixed a bug where the request body was undercounted by
1 descriptor, leading to data corruption when there are multiple
descriptors in 1 request.

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Enabled multi-segment requests by increasing seg_max from 1 to 8.
Guests can now make up to 32k size requests, rather than 4k like before.
This can be increased further but risk running out of space in the sDDF
Block data region.

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
By default the mountpoint `/mnt` always exist. So if the mount process
failed, it only print an easy to miss message and can mislead the user
into thinking that they are using the virtio block device. But instead
they are just writing data into the ramdisk.

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants