raidz: fix space accounting after expansion#18324
raidz: fix space accounting after expansion#18324Skountz wants to merge 1 commit intoopenzfs:masterfrom
Conversation
efca6c5 to
2b9d11e
Compare
|
Maybe it would be worthwhile to add some explicit tests of this functionality, to verify that the amount of data required to fill a pool is approximately equal to the remaining freespace that is reported? For example, create a pool and then do a series of large writes interspersed with raidz expansions, while verifying that the size of the write, the amount of freespace remaining, and the amount of space consumed before/after are all within some tolerance of each other? It might also be worthwhile doing a version with where you do a ZFS rewrite each time inbetween expansions. |
2b9d11e to
e7b3074
Compare
|
@owlshrimp Good idea — added raidz_expand_009_pos (multi-expansion with writes between each step) and raidz_expand_010_pos (rewrites between expansions + fills pool to verify freespace accuracy). |
I suppose it would be more correct to say that the reported freespace before vs after should be approximately equal to the size of the write (with compression turned off, anyway). The various cases of "how much space should this have taken" with different stripe sizes makes my head hurt though, especially if you delete old data. Deleting old data is probably another area where tests verifying expected behavior would be useful. |
|
Mh, the two new tests cover both. Test 010 does delete+rewrite cycles between expansions, then fills the pool to verify reported freespace actually matches writable capacity. Let me know if you think that covers it or if you had something else in mind. |
Reading through the testing strategies, I think this probably covers everything I can think of offhand. Maybe there are weird edgecases with mixes of pre and post expansion data, but 09 seems like at least a very good sanity check. What would we expect to happen to the total figures if some quantity of pre-expansion data is deleted? I presume the freespace would go up by a bit more than the amount of data deleted, due to the difference in parity ratio? |
Right, pre-expansion blocks use more raw space per logical byte (worse parity ratio), so deleting them frees raw capacity that under the new geometry maps to more usable space. That's the convergence the commit message describes: as old blocks get deleted or rewritten, accounting gradually aligns with the new geometry. |
|
I suppose every time you do an expansion (with these changes) the pre-expansion blocks belonging to files and snapshots look like they consume slightly more usable space than expected for their logical size. Have you double checked things like the amount of space reported as consumed by old snapshots whose files have been deleted from the current view makes sense in some way? (eg. |
|
Yes, pre-expansion blocks genuinely cost more raw space per logical byte (worse parity ratio), so that's expected and correct. Good call on snapshots though, I don't have explicit test coverage for that. I'll add one just after dinner (UTC+1 here!) |
e7b3074 to
70330f7
Compare
|
Also added an export/import check to test 008 — vdev_deflate_ratio_current isn't persisted on disk, it's recomputed during vdev_open() on import. So if that recomputation ever regressed, all accounting would silently break after a reboot or pool reimport. Now verified it survives the cycle. |
|
I guess every time you delete a pre-expansion file that exists in a snapshot, the snapshot grows by the same slightly-more-than-logical amount of raw size as the file used to occupy? |
70330f7 to
c5f8bf9
Compare
|
Yes exactly, the snapshot's used increases by the dsize of those blocks, which uses the old deflation ratio (via birth txg). So it reflects the actual raw cost of holding pre-expansion data. And test 11 already verifies this. |
|
Assuming one rewrote all the data (and didn't keep any snapshots) are there any lingering differences between an expanded array and one that was constructed natively, given this patchset? |
|
No, from this patchset's perspective there are no lingering differences. Once all data is rewritten under the new geometry, the per-block deflation ratios all reflect the current layout (identical to native), and the SPA-level capacity correction produces the same reported size as a natively constructed array. The expansion feature itself does leave persistent metadata on disk (e.g., the raidz_expansion feature flag, expansion history), but that is unrelated to this patchset and does not affect space accounting or I/O behavior. |
68c0a4d to
0ecea7f
Compare
|
This also follows up on the "time dependent deflate ratio" concept originally described by @ahrens in #12225 (comment) |
| uint64_t birth = ddt_phys_birth(ddp, v); | ||
| for (int d = 0; d < ndvas; d++) | ||
| dsize += dva_get_dsize_sync(spa, &dvas[d]); | ||
| dsize += dva_get_dsize_sync(spa, &dvas[d], birth); |
There was a problem hiding this comment.
This might be a bit controversial. IIRC DDT can add more copies to existing records, but there is only one ddp_phys_birth there. I guess the DDT stats may survive this, but it may need to be considered for other places.
There was a problem hiding this comment.
Added a comment. I checked the other callers of dva_get_dsize_sync() and it's only bp_get_dsize_sync()/bp_get_dsize() which use BP_GET_PHYSICAL_BIRTH, so the DDT stats path is the only place with this approximation.
Since ddt_phys_extend() preserves the original birth when adding copies, the approximation only affects DDT statistics, not persistent accounting.
module/zfs/spa.c
Outdated
| int64_t new_ds = (space >> | ||
| SPA_MINBLOCKSHIFT) * | ||
| vd->vdev_deflate_ratio_current; | ||
| usable += (new_ds - old_ds); |
There was a problem hiding this comment.
You are correcting reported pool stats, but not individual allocation classes above.
This also makes me wonder what is going to be corrected? Looking on this, I suppose the metaslab stats is not (hopefully not some mix), but looking on dsl_dataset_block_born() using bp_get_dsize_sync(), I suppose the dataset stats will be. And in case expansion happened before this patch, it will get some wrong mix of corrected an uncorrected values. I think this change might need a pool feature flag, tracking since what TXG we use corrected space usage.
There was a problem hiding this comment.
Fixed the per-class stats. The correction is now computed upfront and passed to spa_prop_add_metaslab_class(), so NORMAL_USABLE/NORMAL_AVAILABLE also get corrected.
On the feature flag: you raise a fair point. For blocks born post-expansion but before this patch, the old code used the txg-0 ratio at born time, while the new code would compute a different ratio at free time based on the birth txg.
That's a mismatch for those specific blocks. However for pre-expansion blocks there's no issue since the birth-txg lookup returns the original geometry anyway. The mismatch is bounded to blocks written between the expansion and the patch, and self-corrects as those blocks get rewritten. Do you think that's acceptable or would you prefer a feature flag to track the boundary?
There was a problem hiding this comment.
Every time you allocate with one deflation ratio and free with another, you create a leak in absolute bytes instead of percents. While the last are compensated and only a cosmetic issue, the first I think will result either in unusable space or pool overflow. Neither of those are acceptable.
My main question actually was: can we properly correct metaslab stats instead of this sketchy correction?
There was a problem hiding this comment.
Right, I didn't think that through. The byte leak on that transition window is a real issue. When you say properly correct metaslab stats, do you mean updating vdev_deflate_ratio and the accumulated dspace at expansion completion, rather than the post-hoc correction?
There was a problem hiding this comment.
I am not sure we can really fix already expanded vdevs, so my thinking is towards adding a new feature, using different (variable) deflate ratios since that TXG, and allowing rewrite to fix accounting of the old data with time.
There was a problem hiding this comment.
Thanks for the feedback! Added the raidz_expansion_accounting feature flag as you suggested.
It records the activation txg via enabled_txg and gates per-block ratio tracking in vdev_get_deflate_ratio() so blocks born before the feature use the legacy fixed ratio. No more born/free mismatch risk.
Also added raidz_expand_012_pos to test the feature lifecycle and born/free leak detection.
Looking forward to your thoughts, feels like the last piece of the puzzle to make it right!
| int64_t new_ds = (space >> | ||
| SPA_MINBLOCKSHIFT) * | ||
| vd->vdev_deflate_ratio_current; | ||
| spa->spa_rdspace += (new_ds - old_ds); |
There was a problem hiding this comment.
spa_rdspace above calculated based only on normal class, while here you correct for all classes.
There was a problem hiding this comment.
Good catch, added a vd->vdev_mg->mg_class filter. Same in spa_prop_get_config().
| * This block was born after an expansion changed the geometry. | ||
| * Compute the ratio from the birth-txg-specific asize. | ||
| */ | ||
| return ((1 << 17) / (asize_at_birth >> SPA_MINBLOCKSHIFT)); |
There was a problem hiding this comment.
Don't you overengineer something here?
There was a problem hiding this comment.
Simplified the function. For non-expanded vdevs it now returns the cached ratio immediately. The dynamic computation via vdev_psize_to_asize_txg() is still there for expanded RAIDZ because there can be multiple expansions with different intermediate geometries (the AVL tree can have several entries), so we can't just pick between the two cached ratios. But if you think supporting only old vs current is enough in practice, that would simplify things further. Let me know what you think.
(Edit: I reverted the simplification — the code is back to the version you commented. I tried adding a fast path using vdev_deflate_ratio_current to skip the AVL lookups for non-expanded vdevs, but it caused hangs during expansion tests. Even with the asize_at_birth == asize_at_txg0 comparison preserved, just adding the vdev_deflate_ratio_current check was enough to trigger the issue. I haven't been able to pinpoint the root cause yet. If you have a simpler approach in mind, I'm open to it.)
There was a problem hiding this comment.
We have to support all the ratios that existed there to make the accounting match.
There was a problem hiding this comment.
Got it, thanks for confirming. The code is already covering this then?
| bp_get_dsize(spa_t *spa, const blkptr_t *bp) | ||
| { | ||
| uint64_t dsize = 0; | ||
| uint64_t birth_txg = BP_GET_PHYSICAL_BIRTH(bp); |
There was a problem hiding this comment.
This made me think what time should be used here. From one side dataset space accounting is done at BP_GET_BIRTH(bp) IIRC. From another, in case of dedup/cloning deflation might be from before the expansion, as pointed by physical time. In case of remap physical birth might be ahead of the logical, but we don't have raidz removal now...
There was a problem hiding this comment.
Added a comment. I went with physical birth because the deflation ratio should reflect the geometry when the DVAs were actually allocated. For dedup/clones the logical birth can be newer but the underlying DVAs still live in whatever geometry existed at physical birth, so I think it's correct.
As you say the remap case isn't a concern today, but would need to be revisited if raidz removal ever becomes a thing. Let me know if you see a case where this breaks down.
There was a problem hiding this comment.
but the underlying DVAs still live in whatever geometry existed at physical birth, so I think it's correct
Right. We just need to make sure that datasets accounting use the same TXG, so that that numbers match.
but would need to be revisited if raidz removal ever becomes a thing
Thinking like that may result in RAIDZ removal never implemented as even more complicated. We should at least try to think ahead.
There was a problem hiding this comment.
Apologies, that came off wrong (my english is far from perfect!).
In the current code, both born and kill use BP_GET_PHYSICAL_BIRTH so the ratio matches for a given BP.
But if remap changes the physical birth to a post-expansion txg for a block originally born pre-expansion, that would create a born/free mismatch.
There was a problem hiding this comment.
@amotin You were right to push on this. After a few sleepless nights and ~20 VM resets from kernel crashes, I have a working RAIDZ top-level vdev removal branch (with tests!) that handles the accounting interaction properly. This PR is a blocker for it, I'll push once this is merged, let's keep the (hopefully) good surprise for later.
There was a problem hiding this comment.
How are you handling the reduction in redundancy (afaik) that has made raidZ removal impossible up until now? (being as removal works by mapping the vdev's blocks and their snapshots to virtual equivalents on other vdevs)
This is me guessing what the problem was from a barely-remembered 2nd hand description, but I think the issue was somewhere along the lines of "if you have a 1+5 wide raidZ removed and turned into a virtual device on a 1+2 wide raidZ, then loosing one disk in the new 1+2 radiz looses two blocks per stripe from the (now virtual) 1+5, killing it's ability to be read even though the underlying 1+2 still works."
There was a problem hiding this comment.
Good question again! What you describe is essentially RAIDZ shrinking, which amotin himself stated is "impossible even theoretically" in the original issue (#9013). The destination vdevs can't be RAIDZ at all, spa_vdev_remove_top_check() already rejects removal if any destination vdev has parity...so the remapped blocks land on mirrors or singletons.
The actual hard part was that the existing removal code copies raw byte segments, which doesn't work for RAIDZ because you need block boundaries to read through the parity layout. The solution builds a reverse map of block boundaries first, then copies each block individually through the RAIDZ I/O path.
(But that's another story, let's not make too much noise about it on this PR :) Feel free to DM me if you want more details!)
There was a problem hiding this comment.
this raises a different problem: you can only remove a raidz from an array where it is the one and ONLY raidz. unless, you only allow allocation on vdevs that are non-raidz out of a mix of {raidz, mirror, singleton} and even then you may make the fullness of the vdevs in the array very unbalanced. I think anyraid ran into a balance problem like this.
There was a problem hiding this comment.
I could see removal *to raidz vdevs* being possible only in the extremely undesirable situation where blocks in the virtual raidz were duplicated such the copies could collectively tolerate any combination of underlying raidz members being lost. (bringing a heavy space penalty for removal) there is iir some support for redundant data even on singletons, but afaik this would largely be entirely new machinery and may not play well with the way block pointers work
There was a problem hiding this comment.
Not quite, you can have multiple RAIDZ vdevs and remove one, as long as there are mirror/singleton destinations.
But let's save the rest for the removal PR.
73cbd85 to
28b6f95
Compare
66dc636 to
b7f308a
Compare
|
As a side note, do these patches handle vertical expansion of an array, ie. going from 4TB drives to 6TB drives? |
No, vertical expansion (replacing drives with larger ones) is a completely separate process from RAIDZ expansion. |
I just simply mean, is that change in size another thing that might need to be accounted for in the corrections this patchset calculates? As a potential source of bugs. |
b7f308a to
ba1ccdd
Compare
Good question but no, drive replacement doesn't change the deflation ratio. |
43a6a78 to
86335d8
Compare
After RAIDZ expansion, the deflation ratio used for space accounting remains at the original (pre-expansion) geometry because vdev_set_deflate_ratio() intentionally hard-codes txg 0 to avoid inconsistently accounting for existing blocks. This causes zpool list to underreport usable capacity. For example, expanding a 4-disk RAIDZ2 to 5 disks improves the data-to-parity ratio from 2/4 to 3/5, a ~20% increase in usable capacity per unit of raw storage, but the reported capacity does not reflect this improvement. Fix this in three parts: 1. Add vdev_deflate_ratio_current to vdev_t, computed using UINT64_MAX (current geometry). Apply a dspace correction in spa_update_dspace() and spa_prop_get_config() using the difference between the two ratios, so that reported pool capacity reflects the current parity layout. The correction is scoped to the normal metaslab class (where RAIDZ vdevs reside) and is applied to both per-class and aggregate stats. The original txg-0 ratio is preserved in vdev_deflated_space() for self-consistent metaslab accounting and persistent DN_USED_BYTES tracking. 2. Add vdev_get_deflate_ratio(vd, birth_txg) which returns the correct deflation ratio for a block based on its physical birth txg. For non-expanded vdevs the cached ratio is returned immediately; for expanded RAIDZ, the geometry at birth is looked up via vdev_psize_to_asize_txg() and the existing reflow_node_t expansion history AVL tree. Modify dva_get_dsize_sync() to accept a birth_txg parameter and bp_get_dsize_sync() to use BP_GET_PHYSICAL_BIRTH(), which reflects the actual on-disk allocation geometry even for dedup/clone blocks whose logical birth may differ. 3. Add the raidz_expansion_accounting feature flag, which records the txg at which per-block deflation ratio tracking was enabled. Blocks born before this txg use the legacy fixed ratio (matching how they were originally accounted); blocks born at or after use the per-birth-txg ratio. This prevents born/free accounting mismatches for blocks written between a pre-patch expansion and the first expansion with this code. The feature is activated in vdev_raidz_attach_sync() on the first expansion and its enabled_txg is cached in spa_raidz_expand_acct_txg for efficient runtime lookup. The three changes together maintain self-consistency: metaslab accounting uses the original ratio (alloc/free always balance), capacity reporting is corrected at the spa level, and per-block accounting uses stable birth-txg ratios. No negative overflows or positive leaks are possible. As old blocks are rewritten to the new geometry, the accounting converges to exact values. Signed-off-by: Skountz <dev@frenchbytes.fr>
86335d8 to
f759645
Compare
Motivation and Context
After RAIDZ expansion, usable capacity is severely underreported because the deflation ratio (which converts raw
physical space to logical usable space) remains stuck at the original pre-expansion geometry. Both
zfs list -o availableandzpool list -o usable,availableshow values based on the old, less efficient parity layout.Fixes #17784
Fixes #18199
Description
vdev_set_deflate_ratio()hard-codes txg 0 to compute the ratio. After expansion, new writes use a more efficientgeometry but the accounting never reflects this.
Fix in three parts:
Add
vdev_deflate_ratio_current(computed with UINT64_MAX) alongside the existingvdev_deflate_ratio(txg0). Apply a capacity correction inspa_update_dspace()(forzfs listavailable space, quota enforcement, and space reservations) and inspa_prop_get_config()(forZPOOL_PROP_USABLE/ZPOOL_PROP_AVAILABLE), using the difference between the two ratios so that reported usable capacity reflects the current parity layout.The original txg-0 ratio is preserved in
vdev_deflated_space()for self-consistent metaslab accounting and persistent DN_USED_BYTES tracking.Note:
zpool listdefault columns (SIZE,ALLOC,FREE) show raw physical space and are not deflation-adjusted; this is by design and unchanged.Add
vdev_get_deflate_ratio(vd, birth_txg)for per-block deflation usingBP_GET_PHYSICAL_BIRTH(). Pass birth txg throughdva_get_dsize_sync()andbp_get_dsize_sync(). Reuses existingreflow_node_tAVL tree infrastructure.Add the
raidz_expansion_accountingfeature flag (depends onenabled_txg). Records the txg at which per-block deflation ratio tracking was enabled. Blocks born before that txg use the legacy fixed ratio (matching how they were originally accounted); blocks born at or after use the per-birth-txg ratio. Activated invdev_raidz_attach_sync()on the first expansion. Prevents born/free accounting mismatches for blocks written between a pre-patch expansion and the first expansion with this code.Self-consistency is preserved: metaslab accounting uses the original ratio (alloc/free always balance), capacity
reporting is corrected at the spa level, and per-block accounting uses stable birth-txg ratios. No negative overflows
or positive leaks.
How Has This Been Tested?
make checkstyleclean.raidz_expand_008_pos: Creates RAIDZ2-4, writes data, expands to 5 disks, verifies usable capacity increase (~50%) exceeds the raw 25% disk addition (confirming the deflation ratio correction is applied toZPOOL_PROP_USABLE), verifieszfsdataset available also reflects the correction (exercises thespa_update_dspace()code path), checks alloc+free≈size consistency, and verifies all accounting survives an export/import cycle.raidz_expand_009_pos: Multiple sequential expansions (RAIDZ1 3→4→5→6 disks) with writes interspersed between each. Verifies that consumed space matches write size, free space decrease tracks consumed increase, and alloc+free≈size holds at every step.raidz_expand_010_pos: Rewrite-between-expansions test (RAIDZ1 3→4→5). Deletes and rewrites all data after each expansion to force blocks from old geometry to new. Then fills the pool to capacity and verifies that total data written matches reported freespace within 30% tolerance.raidz_expand_011_pos: Snapshot accounting across expansions. Takes snapshots before/between two expansions, deletes files from the active dataset, and verifies that snapshot referenced/used sizes remain stable and consistent. Also verifies pool-level accounting holds after snapshot destruction.raidz_expand_012_pos: Feature flag lifecycle and born/free leak detection. Verifiesraidz_expansion_accountingactivates on expansion, survives export/import, and that deleting data written under both old and new geometries returns both pool-level allocated and per-dataset referenced to near-zero (no born/free accounting leak viabp_get_dsize_sync). Also tests second expansion and stress write/delete cycles.Types of changes
Note:
spa_feature_tenum changed (newSPA_FEATURE_RAIDZ_EXPANSION_ACCOUNTING), which updatesspa_feature_tablesize and thelibzfs.abi/libzfs_core.abifiles accordingly.dva_get_dsize_sync()signature changed andvdev_get_deflate_ratio()was added, but these are kernel-internal symbols (EXPORT_SYMBOL) only.Checklist: