Skip to content

raidz: fix space accounting after expansion#18324

Open
Skountz wants to merge 1 commit intoopenzfs:masterfrom
Skountz:fix/raidz-expansion-space-accounting
Open

raidz: fix space accounting after expansion#18324
Skountz wants to merge 1 commit intoopenzfs:masterfrom
Skountz:fix/raidz-expansion-space-accounting

Conversation

@Skountz
Copy link

@Skountz Skountz commented Mar 14, 2026

Motivation and Context

After RAIDZ expansion, usable capacity is severely underreported because the deflation ratio (which converts raw
physical space to logical usable space) remains stuck at the original pre-expansion geometry. Both zfs list -o available and zpool list -o usable,available show values based on the old, less efficient parity layout.

Fixes #17784
Fixes #18199

Description

vdev_set_deflate_ratio() hard-codes txg 0 to compute the ratio. After expansion, new writes use a more efficient
geometry but the accounting never reflects this.

Fix in three parts:

  1. Add vdev_deflate_ratio_current (computed with UINT64_MAX) alongside the existing vdev_deflate_ratio (txg0). Apply a capacity correction in spa_update_dspace() (for zfs list available space, quota enforcement, and space reservations) and in spa_prop_get_config() (for ZPOOL_PROP_USABLE / ZPOOL_PROP_AVAILABLE), using the difference between the two ratios so that reported usable capacity reflects the current parity layout.
    The original txg-0 ratio is preserved in vdev_deflated_space() for self-consistent metaslab accounting and persistent DN_USED_BYTES tracking.
    Note: zpool list default columns (SIZE, ALLOC, FREE) show raw physical space and are not deflation-adjusted; this is by design and unchanged.

  2. Add vdev_get_deflate_ratio(vd, birth_txg) for per-block deflation using BP_GET_PHYSICAL_BIRTH(). Pass birth txg through dva_get_dsize_sync() and bp_get_dsize_sync(). Reuses existing reflow_node_t AVL tree infrastructure.

  3. Add the raidz_expansion_accounting feature flag (depends on enabled_txg). Records the txg at which per-block deflation ratio tracking was enabled. Blocks born before that txg use the legacy fixed ratio (matching how they were originally accounted); blocks born at or after use the per-birth-txg ratio. Activated in vdev_raidz_attach_sync() on the first expansion. Prevents born/free accounting mismatches for blocks written between a pre-patch expansion and the first expansion with this code.

Self-consistency is preserved: metaslab accounting uses the original ratio (alloc/free always balance), capacity
reporting is corrected at the spa level, and per-block accounting uses stable birth-txg ratios. No negative overflows
or positive leaks.

How Has This Been Tested?

  • Full build on Ubuntu 24.04, make checkstyle clean.
  • All existing raidz_expand tests (001-007) pass.
  • Added five new ZTS tests:
    • raidz_expand_008_pos: Creates RAIDZ2-4, writes data, expands to 5 disks, verifies usable capacity increase (~50%) exceeds the raw 25% disk addition (confirming the deflation ratio correction is applied to ZPOOL_PROP_USABLE), verifies zfs dataset available also reflects the correction (exercises the spa_update_dspace() code path), checks alloc+free≈size consistency, and verifies all accounting survives an export/import cycle.
    • raidz_expand_009_pos: Multiple sequential expansions (RAIDZ1 3→4→5→6 disks) with writes interspersed between each. Verifies that consumed space matches write size, free space decrease tracks consumed increase, and alloc+free≈size holds at every step.
    • raidz_expand_010_pos: Rewrite-between-expansions test (RAIDZ1 3→4→5). Deletes and rewrites all data after each expansion to force blocks from old geometry to new. Then fills the pool to capacity and verifies that total data written matches reported freespace within 30% tolerance.
    • raidz_expand_011_pos: Snapshot accounting across expansions. Takes snapshots before/between two expansions, deletes files from the active dataset, and verifies that snapshot referenced/used sizes remain stable and consistent. Also verifies pool-level accounting holds after snapshot destruction.
    • raidz_expand_012_pos: Feature flag lifecycle and born/free leak detection. Verifies raidz_expansion_accounting activates on expansion, survives export/import, and that deleting data written under both old and new geometries returns both pool-level allocated and per-dataset referenced to near-zero (no born/free accounting leak via bp_get_dsize_sync). Also tests second expansion and stress write/delete cycles.
  • Tested on Debian 13 VM with ZFS kernel modules loaded.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Note: spa_feature_t enum changed (new SPA_FEATURE_RAIDZ_EXPANSION_ACCOUNTING), which updates spa_feature_table size and the libzfs.abi/libzfs_core.abi files accordingly. dva_get_dsize_sync() signature changed and vdev_get_deflate_ratio() was added, but these are kernel-internal symbols (EXPORT_SYMBOL) only.

Checklist:

  • My code follows the OpenZFS code style requirements.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch 5 times, most recently from efca6c5 to 2b9d11e Compare March 14, 2026 19:08
@Skountz
Copy link
Author

Skountz commented Mar 14, 2026

@amotin — following your feedback on #18170, I've implemented the per-block birth txg approach you described, using BP_GET_PHYSICAL_BIRTH() and the existing reflow_node_t AVL tree. Metaslab self-consistency is preserved.
Would appreciate your thoughts!

@owlshrimp
Copy link

Maybe it would be worthwhile to add some explicit tests of this functionality, to verify that the amount of data required to fill a pool is approximately equal to the remaining freespace that is reported?

For example, create a pool and then do a series of large writes interspersed with raidz expansions, while verifying that the size of the write, the amount of freespace remaining, and the amount of space consumed before/after are all within some tolerance of each other? It might also be worthwhile doing a version with where you do a ZFS rewrite each time inbetween expansions.

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch from 2b9d11e to e7b3074 Compare March 15, 2026 16:06
@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

@owlshrimp Good idea — added raidz_expand_009_pos (multi-expansion with writes between each step) and raidz_expand_010_pos (rewrites between expansions + fills pool to verify freespace accuracy).
All passing locally!

@owlshrimp
Copy link

...verifying that the size of the write, the amount of freespace remaining, and the amount of space consumed before/after are all within some tolerance of each other?

I suppose it would be more correct to say that the reported freespace before vs after should be approximately equal to the size of the write (with compression turned off, anyway). The various cases of "how much space should this have taken" with different stripe sizes makes my head hurt though, especially if you delete old data. Deleting old data is probably another area where tests verifying expected behavior would be useful.

@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

Mh, the two new tests cover both.
Test 009 does multi-expansion with writes at each step, checking that freespace decrease tracks allocated increase (not the logical write size directly — on RAIDZ those differ due to parity).

Test 010 does delete+rewrite cycles between expansions, then fills the pool to verify reported freespace actually matches writable capacity.

Let me know if you think that covers it or if you had something else in mind.

@owlshrimp
Copy link

owlshrimp commented Mar 15, 2026

Let me know if you think that covers it or if you had something else in mind.

Reading through the testing strategies, I think this probably covers everything I can think of offhand. Maybe there are weird edgecases with mixes of pre and post expansion data, but 09 seems like at least a very good sanity check.

What would we expect to happen to the total figures if some quantity of pre-expansion data is deleted? I presume the freespace would go up by a bit more than the amount of data deleted, due to the difference in parity ratio?

@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

What would we expect to happen to the total figures if some quantity of pre-expansion data is deleted? I presume the freespace would go up by a bit more than the amount of data deleted, due to the difference in parity ratio?

Right, pre-expansion blocks use more raw space per logical byte (worse parity ratio), so deleting them frees raw capacity that under the new geometry maps to more usable space. That's the convergence the commit message describes: as old blocks get deleted or rewritten, accounting gradually aligns with the new geometry.
Test 010 exercises exactly that path.

@owlshrimp
Copy link

I suppose every time you do an expansion (with these changes) the pre-expansion blocks belonging to files and snapshots look like they consume slightly more usable space than expected for their logical size.

Have you double checked things like the amount of space reported as consumed by old snapshots whose files have been deleted from the current view makes sense in some way? (eg. in zfs list -t all) Just thinking of potential edge cases.

@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

Yes, pre-expansion blocks genuinely cost more raw space per logical byte (worse parity ratio), so that's expected and correct.
The accounting stays self-consistent because bp_get_dsize_sync() always uses the block's physical birth txg to look up the deflation ratio, so a given block always produces the same dsize regardless of when you compute it.

Good call on snapshots though, I don't have explicit test coverage for that. I'll add one just after dinner (UTC+1 here!)

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch from e7b3074 to 70330f7 Compare March 15, 2026 21:07
@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

Also added an export/import check to test 008 — vdev_deflate_ratio_current isn't persisted on disk, it's recomputed during vdev_open() on import. So if that recomputation ever regressed, all accounting would silently break after a reboot or pool reimport. Now verified it survives the cycle.

@owlshrimp
Copy link

I guess every time you delete a pre-expansion file that exists in a snapshot, the snapshot grows by the same slightly-more-than-logical amount of raw size as the file used to occupy?

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch from 70330f7 to c5f8bf9 Compare March 15, 2026 21:31
@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

Yes exactly, the snapshot's used increases by the dsize of those blocks, which uses the old deflation ratio (via birth txg). So it reflects the actual raw cost of holding pre-expansion data. And test 11 already verifies this.

@owlshrimp
Copy link

Assuming one rewrote all the data (and didn't keep any snapshots) are there any lingering differences between an expanded array and one that was constructed natively, given this patchset?

@Skountz
Copy link
Author

Skountz commented Mar 15, 2026

No, from this patchset's perspective there are no lingering differences. Once all data is rewritten under the new geometry, the per-block deflation ratios all reflect the current layout (identical to native), and the SPA-level capacity correction produces the same reported size as a natively constructed array.

The expansion feature itself does leave persistent metadata on disk (e.g., the raidz_expansion feature flag, expansion history), but that is unrelated to this patchset and does not affect space accounting or I/O behavior.

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch 2 times, most recently from 68c0a4d to 0ecea7f Compare March 16, 2026 13:12
@Skountz
Copy link
Author

Skountz commented Mar 16, 2026

This also follows up on the "time dependent deflate ratio" concept originally described by @ahrens in #12225 (comment)

uint64_t birth = ddt_phys_birth(ddp, v);
for (int d = 0; d < ndvas; d++)
dsize += dva_get_dsize_sync(spa, &dvas[d]);
dsize += dva_get_dsize_sync(spa, &dvas[d], birth);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit controversial. IIRC DDT can add more copies to existing records, but there is only one ddp_phys_birth there. I guess the DDT stats may survive this, but it may need to be considered for other places.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. I checked the other callers of dva_get_dsize_sync() and it's only bp_get_dsize_sync()/bp_get_dsize() which use BP_GET_PHYSICAL_BIRTH, so the DDT stats path is the only place with this approximation.
Since ddt_phys_extend() preserves the original birth when adding copies, the approximation only affects DDT statistics, not persistent accounting.

module/zfs/spa.c Outdated
int64_t new_ds = (space >>
SPA_MINBLOCKSHIFT) *
vd->vdev_deflate_ratio_current;
usable += (new_ds - old_ds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correcting reported pool stats, but not individual allocation classes above.

This also makes me wonder what is going to be corrected? Looking on this, I suppose the metaslab stats is not (hopefully not some mix), but looking on dsl_dataset_block_born() using bp_get_dsize_sync(), I suppose the dataset stats will be. And in case expansion happened before this patch, it will get some wrong mix of corrected an uncorrected values. I think this change might need a pool feature flag, tracking since what TXG we use corrected space usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the per-class stats. The correction is now computed upfront and passed to spa_prop_add_metaslab_class(), so NORMAL_USABLE/NORMAL_AVAILABLE also get corrected.

On the feature flag: you raise a fair point. For blocks born post-expansion but before this patch, the old code used the txg-0 ratio at born time, while the new code would compute a different ratio at free time based on the birth txg.
That's a mismatch for those specific blocks. However for pre-expansion blocks there's no issue since the birth-txg lookup returns the original geometry anyway. The mismatch is bounded to blocks written between the expansion and the patch, and self-corrects as those blocks get rewritten. Do you think that's acceptable or would you prefer a feature flag to track the boundary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time you allocate with one deflation ratio and free with another, you create a leak in absolute bytes instead of percents. While the last are compensated and only a cosmetic issue, the first I think will result either in unusable space or pool overflow. Neither of those are acceptable.

My main question actually was: can we properly correct metaslab stats instead of this sketchy correction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I didn't think that through. The byte leak on that transition window is a real issue. When you say properly correct metaslab stats, do you mean updating vdev_deflate_ratio and the accumulated dspace at expansion completion, rather than the post-hoc correction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can really fix already expanded vdevs, so my thinking is towards adding a new feature, using different (variable) deflate ratios since that TXG, and allowing rewrite to fix accounting of the old data with time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Added the raidz_expansion_accounting feature flag as you suggested.

It records the activation txg via enabled_txg and gates per-block ratio tracking in vdev_get_deflate_ratio() so blocks born before the feature use the legacy fixed ratio. No more born/free mismatch risk.

Also added raidz_expand_012_pos to test the feature lifecycle and born/free leak detection.

Looking forward to your thoughts, feels like the last piece of the puzzle to make it right!

int64_t new_ds = (space >>
SPA_MINBLOCKSHIFT) *
vd->vdev_deflate_ratio_current;
spa->spa_rdspace += (new_ds - old_ds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spa_rdspace above calculated based only on normal class, while here you correct for all classes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added a vd->vdev_mg->mg_class filter. Same in spa_prop_get_config().

* This block was born after an expansion changed the geometry.
* Compute the ratio from the birth-txg-specific asize.
*/
return ((1 << 17) / (asize_at_birth >> SPA_MINBLOCKSHIFT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you overengineer something here?

Copy link
Author

@Skountz Skountz Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the function. For non-expanded vdevs it now returns the cached ratio immediately. The dynamic computation via vdev_psize_to_asize_txg() is still there for expanded RAIDZ because there can be multiple expansions with different intermediate geometries (the AVL tree can have several entries), so we can't just pick between the two cached ratios. But if you think supporting only old vs current is enough in practice, that would simplify things further. Let me know what you think.

(Edit: I reverted the simplification — the code is back to the version you commented. I tried adding a fast path using vdev_deflate_ratio_current to skip the AVL lookups for non-expanded vdevs, but it caused hangs during expansion tests. Even with the asize_at_birth == asize_at_txg0 comparison preserved, just adding the vdev_deflate_ratio_current check was enough to trigger the issue. I haven't been able to pinpoint the root cause yet. If you have a simpler approach in mind, I'm open to it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to support all the ratios that existed there to make the accounting match.

Copy link
Author

@Skountz Skountz Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for confirming. The code is already covering this then?

bp_get_dsize(spa_t *spa, const blkptr_t *bp)
{
uint64_t dsize = 0;
uint64_t birth_txg = BP_GET_PHYSICAL_BIRTH(bp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me think what time should be used here. From one side dataset space accounting is done at BP_GET_BIRTH(bp) IIRC. From another, in case of dedup/cloning deflation might be from before the expansion, as pointed by physical time. In case of remap physical birth might be ahead of the logical, but we don't have raidz removal now...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. I went with physical birth because the deflation ratio should reflect the geometry when the DVAs were actually allocated. For dedup/clones the logical birth can be newer but the underlying DVAs still live in whatever geometry existed at physical birth, so I think it's correct.
As you say the remap case isn't a concern today, but would need to be revisited if raidz removal ever becomes a thing. Let me know if you see a case where this breaks down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the underlying DVAs still live in whatever geometry existed at physical birth, so I think it's correct

Right. We just need to make sure that datasets accounting use the same TXG, so that that numbers match.

but would need to be revisited if raidz removal ever becomes a thing

Thinking like that may result in RAIDZ removal never implemented as even more complicated. We should at least try to think ahead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, that came off wrong (my english is far from perfect!).
In the current code, both born and kill use BP_GET_PHYSICAL_BIRTH so the ratio matches for a given BP.
But if remap changes the physical birth to a post-expansion txg for a block originally born pre-expansion, that would create a born/free mismatch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amotin You were right to push on this. After a few sleepless nights and ~20 VM resets from kernel crashes, I have a working RAIDZ top-level vdev removal branch (with tests!) that handles the accounting interaction properly. This PR is a blocker for it, I'll push once this is merged, let's keep the (hopefully) good surprise for later.

Copy link

@owlshrimp owlshrimp Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you handling the reduction in redundancy (afaik) that has made raidZ removal impossible up until now? (being as removal works by mapping the vdev's blocks and their snapshots to virtual equivalents on other vdevs)

This is me guessing what the problem was from a barely-remembered 2nd hand description, but I think the issue was somewhere along the lines of "if you have a 1+5 wide raidZ removed and turned into a virtual device on a 1+2 wide raidZ, then loosing one disk in the new 1+2 radiz looses two blocks per stripe from the (now virtual) 1+5, killing it's ability to be read even though the underlying 1+2 still works."

Copy link
Author

@Skountz Skountz Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question again! What you describe is essentially RAIDZ shrinking, which amotin himself stated is "impossible even theoretically" in the original issue (#9013). The destination vdevs can't be RAIDZ at all, spa_vdev_remove_top_check() already rejects removal if any destination vdev has parity...so the remapped blocks land on mirrors or singletons.

The actual hard part was that the existing removal code copies raw byte segments, which doesn't work for RAIDZ because you need block boundaries to read through the parity layout. The solution builds a reverse map of block boundaries first, then copies each block individually through the RAIDZ I/O path.

(But that's another story, let's not make too much noise about it on this PR :) Feel free to DM me if you want more details!)

Copy link

@owlshrimp owlshrimp Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this raises a different problem: you can only remove a raidz from an array where it is the one and ONLY raidz. unless, you only allow allocation on vdevs that are non-raidz out of a mix of {raidz, mirror, singleton} and even then you may make the fullness of the vdevs in the array very unbalanced. I think anyraid ran into a balance problem like this.

Copy link

@owlshrimp owlshrimp Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see removal *to raidz vdevs* being possible only in the extremely undesirable situation where blocks in the virtual raidz were duplicated such the copies could collectively tolerate any combination of underlying raidz members being lost. (bringing a heavy space penalty for removal) there is iir some support for redundant data even on singletons, but afaik this would largely be entirely new machinery and may not play well with the way block pointers work

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, you can have multiple RAIDZ vdevs and remove one, as long as there are mirror/singleton destinations.

But let's save the rest for the removal PR.

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch 2 times, most recently from 73cbd85 to 28b6f95 Compare March 16, 2026 23:14
@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch 2 times, most recently from 66dc636 to b7f308a Compare March 17, 2026 11:38
@owlshrimp
Copy link

owlshrimp commented Mar 17, 2026

As a side note, do these patches handle vertical expansion of an array, ie. going from 4TB drives to 6TB drives?

@Skountz
Copy link
Author

Skountz commented Mar 17, 2026

As a side note, do these patches handle vertical expansion of an array, ie. going from 4TB drives to 6TB drives?

No, vertical expansion (replacing drives with larger ones) is a completely separate process from RAIDZ expansion.
It goes through the zpool replace + resilver path, which has its own space accounting and is unrelated to the deflation ratio bug this PR addresses.

@owlshrimp
Copy link

owlshrimp commented Mar 17, 2026

As a side note, do these patches handle vertical expansion of an array, ie. going from 4TB drives to 6TB drives?

No, vertical expansion (replacing drives with larger ones) is a completely separate process from RAIDZ expansion. It goes through the zpool replace + resilver path, which has its own space accounting and is unrelated to the deflation ratio bug this PR addresses.

I just simply mean, is that change in size another thing that might need to be accounted for in the corrections this patchset calculates? As a potential source of bugs.

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch from b7f308a to ba1ccdd Compare March 17, 2026 20:42
@Skountz
Copy link
Author

Skountz commented Mar 17, 2026

I just simply mean, is that change in size another thing that might need to be accounted for in the corrections this patchset calculates? As a potential source of bugs.

Good question but no, drive replacement doesn't change the deflation ratio.
It only depends on the number of disks (parity-to-data ratio), not their individual sizes.

@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch 3 times, most recently from 43a6a78 to 86335d8 Compare March 20, 2026 21:29
After RAIDZ expansion, the deflation ratio used for space
accounting remains at the original (pre-expansion) geometry
because vdev_set_deflate_ratio() intentionally hard-codes
txg 0 to avoid inconsistently accounting for existing blocks.
This causes zpool list to underreport usable capacity.  For
example, expanding a 4-disk RAIDZ2 to 5 disks improves the
data-to-parity ratio from 2/4 to 3/5, a ~20% increase in
usable capacity per unit of raw storage, but the reported
capacity does not reflect this improvement.

Fix this in three parts:

1. Add vdev_deflate_ratio_current to vdev_t, computed using
   UINT64_MAX (current geometry).  Apply a dspace correction
   in spa_update_dspace() and spa_prop_get_config() using the
   difference between the two ratios, so that reported pool
   capacity reflects the current parity layout.  The correction
   is scoped to the normal metaslab class (where RAIDZ vdevs
   reside) and is applied to both per-class and aggregate stats.
   The original txg-0 ratio is preserved in vdev_deflated_space()
   for self-consistent metaslab accounting and persistent
   DN_USED_BYTES tracking.

2. Add vdev_get_deflate_ratio(vd, birth_txg) which returns
   the correct deflation ratio for a block based on its
   physical birth txg.  For non-expanded vdevs the cached
   ratio is returned immediately; for expanded RAIDZ, the
   geometry at birth is looked up via vdev_psize_to_asize_txg()
   and the existing reflow_node_t expansion history AVL tree.
   Modify dva_get_dsize_sync() to accept a birth_txg parameter
   and bp_get_dsize_sync() to use BP_GET_PHYSICAL_BIRTH(),
   which reflects the actual on-disk allocation geometry even
   for dedup/clone blocks whose logical birth may differ.

3. Add the raidz_expansion_accounting feature flag, which
   records the txg at which per-block deflation ratio tracking
   was enabled.  Blocks born before this txg use the legacy
   fixed ratio (matching how they were originally accounted);
   blocks born at or after use the per-birth-txg ratio.  This
   prevents born/free accounting mismatches for blocks written
   between a pre-patch expansion and the first expansion with
   this code.  The feature is activated in
   vdev_raidz_attach_sync() on the first expansion and its
   enabled_txg is cached in spa_raidz_expand_acct_txg for
   efficient runtime lookup.

The three changes together maintain self-consistency: metaslab
accounting uses the original ratio (alloc/free always
balance), capacity reporting is corrected at the spa level,
and per-block accounting uses stable birth-txg ratios.  No
negative overflows or positive leaks are possible.  As old
blocks are rewritten to the new geometry, the accounting
converges to exact values.

Signed-off-by: Skountz <dev@frenchbytes.fr>
@Skountz Skountz force-pushed the fix/raidz-expansion-space-accounting branch from 86335d8 to f759645 Compare March 20, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RAIDZ2 Expansion: Unexplained Space Loss & 2x Overhead Significantly inaccurate available space after multiple expansions and then a zfs rewrite

3 participants