Skip to content

spa_sync_rewrite_vdev_config: update only special vdevs if present#18269

Open
Pesc0 wants to merge 1 commit intoopenzfs:masterfrom
Pesc0:txg_update_uberblock_on_special
Open

spa_sync_rewrite_vdev_config: update only special vdevs if present#18269
Pesc0 wants to merge 1 commit intoopenzfs:masterfrom
Pesc0:txg_update_uberblock_on_special

Conversation

@Pesc0
Copy link

@Pesc0 Pesc0 commented Feb 28, 2026

Motivation and Context

This change allows to write the uberblock preferentially only on special vdevs if any are present during transaction sync. This allows to keep rotational drives asleep if not used.
This is an attempt to implement what @amotin suggested here.

How Has This Been Tested?

This has been tested on a proxmox system and indeed the issue is fixed: the hdd stays asleep. Please keep in mind that I have no knowledge of the system and don't know what I'm doing, this may break other things that i don't know about. I've tried to keep the changes minimal to reduce potential problems.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 28, 2026
@amotin
Copy link
Member

amotin commented Mar 2, 2026

I haven't looked on the code, but another thought I've had is that we could update labels on vdevs that were already written to at the TXG. This way all vdevs that have something new, will actually report it.

@Pesc0
Copy link
Author

Pesc0 commented Mar 2, 2026

Regarding solving the actual problem (disk sleep): sure, this works as well.

Regarding which approach is better: intuitively it seems better to distribute the writes between vdevs, but as you pointed out, if the special class dies the pool is toast anyway. It does not seem one approach is more robust than the other, but please correct me if I'm wrong.

@Pesc0 Pesc0 marked this pull request as ready for review March 7, 2026 15:00
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Mar 7, 2026
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think we want is to prefer already dirty top-level vdevs and special devices.

module/zfs/spa.c Outdated
vd->vdev_islog ||
!vdev_is_concrete(vd))
!vdev_is_concrete(vd) ||
(has_special_class && vd->vdev_alloc_bias != VDEV_BIAS_SPECIAL))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also write the config to all top-level vdevs which are already dirty. This way they'll reflect the latest changes for minimal extra cost as @amotin mention. You can check if the top-level vdev was dirtied with txg_list_member(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg)).

module/zfs/spa.c Outdated
int svdcount = 0;
int children = rvd->vdev_children;
int c0 = random_in_range(children);
boolean_t has_special_class = spa->spa_special_class->mc_groups != 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use spa_has_special(spa) for this check.

@Pesc0 Pesc0 force-pushed the txg_update_uberblock_on_special branch from 9af3dd0 to 4a24692 Compare March 10, 2026 09:38
@Pesc0
Copy link
Author

Pesc0 commented Mar 10, 2026

Thanks for helping out :)
Reworked the logic to skip vdevs that are NOT dirty.
This keeps the current mechanism that writes to at most SPA_SYNC_MIN_VDEVS (= 3) if the config did not change.
There is no preference for special class with this logic.

I'll test again when I can and will report back if it still works.

module/zfs/spa.c Outdated
vd->vdev_islog ||
!vdev_is_concrete(vd))
!vdev_is_concrete(vd) ||
!txg_list_member(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible none of the top-level vdevs will be dirty so this gets a bit more complicated. From a policy perspective here's what I think we want.

  1. Prefer updating the config on all dirty top-level vdevs. Minimum 1.
  2. If no top-level vdevs are dirty, but pool contains special vdevs select those.
  3. If no top-level vdevs are dirty, and there are no special vdevs, randomly select up to 3 top-level vdevs (current behavior)

Only updating the special vdevs when present is a smaller change (your original PR), but updating everything that's dirty would be nice and a bit more robust. Having more backup copies of the uberblocks is pretty much always a good thing.

This change allows to write the uberblock only on (in this order of preference):
- vdevs dirtied by the current txg
- special vdevs
- 3 random vdevs

This allows to keep rotational drives asleep if they have not been written to.

Signed-off-by: Pesc0 Pesc0@users.noreply.github.com
@Pesc0 Pesc0 force-pushed the txg_update_uberblock_on_special branch from 4a24692 to d4b80ae Compare March 12, 2026 15:03
@Pesc0
Copy link
Author

Pesc0 commented Mar 12, 2026

Implemented the logic you described. I still need to test if this works, my test machine is a separate system so i first need to push the commit to be able to conveniently pull it there.

I'm not sure if the check

if (vd->vdev_ms_array == 0 ||
    vd->vdev_islog ||
    !vdev_is_concrete(vd))
		continue;

is needed in the first two loops of the function.

Also it's quite hard to keep lines under 80 chars with 8 char tabs and this much level of nesting. A couple of lines are inevitably over 80, hope that's not an issue.

@Pesc0
Copy link
Author

Pesc0 commented Mar 13, 2026

Seems to be working, hdd stays asleep on my test system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants