Fix s_active leak in zfsvfs_hold() when z_unmounted is true by mischivus · Pull Request #18310 · openzfs/zfs

mischivus · 2026-03-12T03:39:07Z

When getzfsvfs() succeeds (incrementing s_active via zfs_vfs_ref()), but z_unmounted is subsequently found to be B_TRUE, zfsvfs_hold() returns EBUSY without calling zfs_vfs_rele(). This permanently leaks the VFS superblock s_active reference, preventing generic_shutdown_super() from ever firing, which blocks dmu_objset_disown() and makes the pool permanently unexportable.

Add the missing zfs_vfs_rele() call, guarded by zfs_vfs_held() to handle the zfsvfs_create() fallback path where no VFS reference exists. This matches the existing cleanup pattern in zfsvfs_rele() immediately below in the same file.

See #18309 for full root cause analysis, reachability proof, and live system evidence including kernel module s_active read.

lundman · 2026-03-12T04:11:38Z

Good find - it seems to me that FreeBSD, Windows and macOS has the same code and would need similar fixes.

robn · 2026-03-12T11:01:54Z

Mm, interesting.

#18310 seems legit, though its extremely difficult to follow. At least some of that is the whole interaction between the ioctls and zfsvfs being a mass of spaghetti and tears. I wonder, did this actually come up in production, or was it discovered via code read, static analysis, some other means?

That said, I agree there's a leak of some sort here, but I'm not entirely sure if this is the right way to close it. I'm wondering why the answer is not to just call zfsvfs_rele()? Is there not a similar situation where getzfsvfs() fails, zfsvfs_create() succeeds, and then is marked for unmount by some other thread? If so, who is doing the other side of zfsvfs_rele(): dmu_objset_disown() and zfsvfs_free()?

It might not be an issue at all, of course, but the path here is not obvious to me without a lot more study. And, I wonder if that's anything to do with the failing tests, of which many seem to involve unmounts and exports.

lundman · 2026-03-12T11:11:36Z

I assumed, perhaps incorrectly, that getzfsvfs() will work for existing things, but is expected to fail in some situations - so perhaps it would be better with a holding = getzfsvfs(); then free based on that.

mischivus · 2026-03-12T16:34:57Z

I discovered this entirely by accident, using ZFS in a really unusual configuration with an optane card, a spinning platter array and whatnot.

I'm an experienced SWE (25 years) and C is one of my languages. After spending about a hundred hours on a deployment script testing EVERY possible workaround, I isolated it to something kernel-level with the filesystem, then eventually ZFS.

So I pulled down the symbols/codebase for ZFS - which is enormous - and poked around trying to trigger it with a debugger attached. Which I was able to do intermittently. So I knew I was on to something.

At that point, not wanting to learn the entire ZFS codebase, I put 150 bucks into the "Google Cannon" and had Gemini 3.1 PRO go at it using static-dynamic analysis (it can DO that now!), based on the specific debug traces.

Gemini found the problem - but it was a race condition - and would be a PAIN to prove. And obviously I didn't want to waste your time with AI slop.

So I used Claude Opus 4.6 to craft a "prover" C code snippet, and the automation to absolutely HAMMER the system with it - because to get this to trigger the requests have to land within maybe 100 nanoseconds of each other (this is a guess not an exact measurement). I let that cook for a few runs, and on my hardware after a few minutes it consistently got into into the locked state. Depending on your hardware you may have to run the "prover" code 20+ times.

At this point I was about ten hours in, so I figured "hey, why not, let's write this one up for them!".

I read through the general flow of the code to get a "somewhat working" understanding of how that portion of ZFS worked, then I went through the code I'd generated and hand-checked everything. All looked good.

I wrote a 2-liner patch for you guys, wrote-up a ticket, and sent in a PR.

I don't understand ZFS well enough to fully and completely test this. It would take me months to learn your codebase. But hey, it's a start. Now handed it over to you because you're the experts.

TLDR:

Discovered by accident
Confirmed on live system via debugger
Isolated using Gemini 3.1 PRO AI
Proven, on a live system, using C-code written by Claude Opus 4.6 AI
All code manually reviewed by a polyglot SWE with 25 years experience
Solution is, in theory, a two-line change
Verifying it works in all scenarios is the ZVS team's job (because it has to work on multiple distros)
I'm a real person, I heavily use ZFS, if you need anything else, just ask ... :)

behlendorf · 2026-03-13T01:27:18Z

I'm wondering why the answer is not to just call zfsvfs_rele()

It's definitely convoluted, but unconditionally calling zfs_vfs_rele() looks like the right fix to me. On Linux the caller will have its own valid reference that needs to be dropped and the racing unmount will have its own. On FreeBSD the man pages for vfs_busy() and vfs_unbusy() seem to say they work the same way. If zfs_vfs_ref() on Windows and macOS behave the same way this may be a one line fix!

@mischivus can you drop the conditional and run this through your "prover" C code snippet stress tester again.

diff --git a/module/zfs/zfs_ioctl.c b/module/zfs/zfs_ioctl.c
index 3bbc9107ae..4fcfd18e50 100644
--- a/module/zfs/zfs_ioctl.c
+++ b/module/zfs/zfs_ioctl.c
@@ -1439,6 +1439,7 @@ zfsvfs_hold(const char *name, const void *tag, zfsvfs_t **zfvp,
                         * objset from the zfsvfs.
                         */
                        ZFS_TEARDOWN_EXIT(*zfvp, tag);
+                       zfs_vfs_rele(*zfvp);
                        return (SET_ERROR(EBUSY));
                }
        }

mischivus · 2026-03-13T02:27:28Z

I can certainly test that, but I'd have to provision a machine - which takes time.

I included both of the scripts on the ticket this PR is attached to ...here they are again so you can test locally.

Let me know if you still need me to run it on one of my franken-servers.

repro_zfsvfs_hold.sh
tight_loop.c

behlendorf · 2026-03-13T17:58:27Z

I'll see if I can reproduce it locally with your reproducer. In the meantime can you update this PR to drop the conditional.

mischivus · 2026-03-13T20:21:12Z

@behlendorf Updated per your request. Let me know if you cannot reproduce. Need ~2 hours to provision a server in configuration that reliably triggers. If you need me to provision a server, specify any instrumentation/telemetry you would like on it. I do not typically debug filesystems at kernel level ... :)

behlendorf · 2026-03-13T20:33:10Z

How long does it usually take in your environment? The test has been running for ~600 iterations now and I haven't hit it. I'll leave it running over the weekend, but if I still haven't reproduced by Monday it would be great if you could verify the updated PR in your environment.

mischivus · 2026-03-13T20:36:17Z

I was able to trigger it within ~20 iterations.

mischivus · 2026-03-13T20:38:47Z

Since it's Friday, I can just hijack some hardware I have up right now and run it. 80% odds I'll get a hit. Just list off all the instrumentation packages / telemetry I should install so I can give you a "X-ray" of the system so you can have deep insight.

behlendorf · 2026-03-13T20:51:53Z

If you can install a build of ZFS with the updated one-line patch applied, then verify you're no longer able to reproduce the problem after that's be sufficient. It sounds like a few 100 iterations should be sufficient, but more is better. I still plan to leave my tests running over the weekend.

mischivus · 2026-03-13T22:04:17Z

@behlendorf Kill the reproducer running on your system. Its insufficient and will probably never trip. It tripped on my system because of existing system state. I'm writing something more robust, and will post it to this PR thread once I have a "control run" against current build and a "test run" against the updated ZFS build.

mischivus · 2026-03-16T07:58:45Z

@behlendorf I've created a robust one-shot "reproducer" that will trigger the original bug within 60 seconds.

However, after compiling my original updated code, I found a second bug #18333, and updated my fork to correct it.

Then after further testing, I found a third bug #18334 - however this one has multiple possible solutions and requires some executive decisions. So I've filed a ticket for it and left it up to the ZFS team to decide how to move forward.

I've attached my updated "reproducer" script and collected evidence to this comment.

phase1_evidence.txt
prover_test.sh
prover.c

mischivus · 2026-03-16T08:54:19Z

Note also, "checkstyle" is failing with 'error: missing "Signed-off-by"' ...but all my commits have a correct sign-off.

Skountz · 2026-03-16T09:26:46Z

Note also, "checkstyle" is failing with 'error: missing "Signed-off-by"' ...but all my commits have a correct sign-off.

The checkstyle failure is caused by the Merge branch 'master' into patch-1 commit — GitHub auto-created it when you synced via the UI, and it doesn't include a Signed-off-by: trailer.
The "Verified" badge on it is just GitHub's own GPG signature on the commit object, which is a completely different thing.

If you rebase instead of merge to sync with master, that commit goes away and checkstyle should pass.

When getzfsvfs() succeeds (incrementing s_active via zfs_vfs_ref()), but z_unmounted is subsequently found to be B_TRUE, zfsvfs_hold() returns EBUSY without calling zfs_vfs_rele(). This permanently leaks the VFS superblock s_active reference, preventing generic_shutdown_super() from ever firing, which blocks dmu_objset_disown() and makes the pool permanently unexportable (EBUSY). Add the missing zfs_vfs_rele() call, guarded by zfs_vfs_held() to handle the zfsvfs_create() fallback path where no VFS reference exists. This matches the existing cleanup pattern in zfsvfs_rele(). Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>

Drop zfs_vfs_held() guard per reviewer feedback The conditional is unnecessary: this path is only reachable via getzfsvfs() success, which always holds a VFS reference. The zfsvfs_create() fallback sets z_unmounted = B_FALSE, so it can never enter this block. Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>

When zfsvfs_init() fails during zfs_resume_fs(), the bail path called zfs_umount() directly. All three callers (zfs_ioc_rollback, zfs_ioc_recv_impl, and zfs_ioc_userspace_upgrade) hold an s_active reference via getzfsvfs() at entry. This creates two bugs: 1. Deadlock: zfs_umount() -> zfsvfs_teardown() -> txg_wait_synced() blocks in uninterruptible D state. The superblock cannot tear down because s_active is pinned by the calling thread itself. Survives SIGKILL. Blocks clean reboot. Requires hard power cycle. 2. Use-after-free: if txg_wait_synced() returns, zfs_umount() calls zfsvfs_free(). The caller then dereferences the freed zfsvfs via zfs_vfs_rele(). The explicit zfs_umount() is unnecessary. z_unmounted is already set to B_TRUE before the locks are released, so all new operations return EBUSY. When the caller releases its s_active reference via zfs_vfs_rele(), the standard VFS lifecycle (deactivate_super -> generic_shutdown_super -> zpl_kill_sb -> zfs_umount) handles teardown correctly. Refs: openzfs#18309 Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>

mischivus force-pushed the patch-1 branch from e2ab06b to 52676f2 Compare March 12, 2026 04:05

robn added Status: Code Review Needed Ready for review and testing Status: Feedback requested More information is requested labels Mar 12, 2026

behlendorf force-pushed the patch-1 branch from 73d215f to db2b3c9 Compare March 12, 2026 22:27

This was referenced Mar 16, 2026

zfs_resume_fs() bail path deadlock and use-after-free on zfsvfs_init() failure #18333

Open

zpool export -f fails to unmount datasets with z_unmounted=B_TRUE #18334

Open

mischivus force-pushed the patch-1 branch from 32e0753 to cb87015 Compare March 16, 2026 08:11

mischivus added 3 commits March 16, 2026 12:21

mischivus force-pushed the patch-1 branch from 721195d to 186b7bc Compare March 16, 2026 18:21

Conversation

mischivus commented Mar 12, 2026

Uh oh!

lundman commented Mar 12, 2026

Uh oh!

robn commented Mar 12, 2026

Uh oh!

lundman commented Mar 12, 2026

Uh oh!

mischivus commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf commented Mar 13, 2026

Uh oh!

mischivus commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf commented Mar 13, 2026

Uh oh!

mischivus commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf commented Mar 13, 2026

Uh oh!

mischivus commented Mar 13, 2026

Uh oh!

mischivus commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf commented Mar 13, 2026

Uh oh!

mischivus commented Mar 13, 2026

Uh oh!

mischivus commented Mar 16, 2026

Uh oh!

mischivus commented Mar 16, 2026

Uh oh!

Skountz commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mischivus commented Mar 12, 2026 •

edited

Loading

mischivus commented Mar 13, 2026 •

edited

Loading

mischivus commented Mar 13, 2026 •

edited

Loading

mischivus commented Mar 13, 2026 •

edited

Loading