Skip to content

Fix s_active leak in zfsvfs_hold() when z_unmounted is true#18310

Open
mischivus wants to merge 3 commits intoopenzfs:masterfrom
mischivus:patch-1
Open

Fix s_active leak in zfsvfs_hold() when z_unmounted is true#18310
mischivus wants to merge 3 commits intoopenzfs:masterfrom
mischivus:patch-1

Conversation

@mischivus
Copy link

Fixes #18309

When getzfsvfs() succeeds (incrementing s_active via zfs_vfs_ref()), but z_unmounted is subsequently found to be B_TRUE, zfsvfs_hold() returns EBUSY without calling zfs_vfs_rele(). This permanently leaks the VFS superblock s_active reference, preventing generic_shutdown_super() from ever firing, which blocks dmu_objset_disown() and makes the pool permanently unexportable.

Add the missing zfs_vfs_rele() call, guarded by zfs_vfs_held() to handle the zfsvfs_create() fallback path where no VFS reference exists. This matches the existing cleanup pattern in zfsvfs_rele() immediately below in the same file.

See #18309 for full root cause analysis, reachability proof, and live system evidence including kernel module s_active read.

@lundman
Copy link
Contributor

lundman commented Mar 12, 2026

Good find - it seems to me that FreeBSD, Windows and macOS has the same code and would need similar fixes.

@robn
Copy link
Member

robn commented Mar 12, 2026

Mm, interesting.

#18310 seems legit, though its extremely difficult to follow. At least some of that is the whole interaction between the ioctls and zfsvfs being a mass of spaghetti and tears. I wonder, did this actually come up in production, or was it discovered via code read, static analysis, some other means?

That said, I agree there's a leak of some sort here, but I'm not entirely sure if this is the right way to close it. I'm wondering why the answer is not to just call zfsvfs_rele()? Is there not a similar situation where getzfsvfs() fails, zfsvfs_create() succeeds, and then is marked for unmount by some other thread? If so, who is doing the other side of zfsvfs_rele(): dmu_objset_disown() and zfsvfs_free()?

It might not be an issue at all, of course, but the path here is not obvious to me without a lot more study. And, I wonder if that's anything to do with the failing tests, of which many seem to involve unmounts and exports.

@robn robn added Status: Code Review Needed Ready for review and testing Status: Feedback requested More information is requested labels Mar 12, 2026
@lundman
Copy link
Contributor

lundman commented Mar 12, 2026

I assumed, perhaps incorrectly, that getzfsvfs() will work for existing things, but is expected to fail in some situations - so perhaps it would be better with a holding = getzfsvfs(); then free based on that.

@mischivus
Copy link
Author

mischivus commented Mar 12, 2026

I discovered this entirely by accident, using ZFS in a really unusual configuration with an optane card, a spinning platter array and whatnot.

I'm an experienced SWE (25 years) and C is one of my languages. After spending about a hundred hours on a deployment script testing EVERY possible workaround, I isolated it to something kernel-level with the filesystem, then eventually ZFS.

So I pulled down the symbols/codebase for ZFS - which is enormous - and poked around trying to trigger it with a debugger attached. Which I was able to do intermittently. So I knew I was on to something.

At that point, not wanting to learn the entire ZFS codebase, I put 150 bucks into the "Google Cannon" and had Gemini 3.1 PRO go at it using static-dynamic analysis (it can DO that now!), based on the specific debug traces.

Gemini found the problem - but it was a race condition - and would be a PAIN to prove. And obviously I didn't want to waste your time with AI slop.

So I used Claude Opus 4.6 to craft a "prover" C code snippet, and the automation to absolutely HAMMER the system with it - because to get this to trigger the requests have to land within maybe 100 nanoseconds of each other (this is a guess not an exact measurement). I let that cook for a few runs, and on my hardware after a few minutes it consistently got into into the locked state. Depending on your hardware you may have to run the "prover" code 20+ times.

At this point I was about ten hours in, so I figured "hey, why not, let's write this one up for them!".

I read through the general flow of the code to get a "somewhat working" understanding of how that portion of ZFS worked, then I went through the code I'd generated and hand-checked everything. All looked good.

I wrote a 2-liner patch for you guys, wrote-up a ticket, and sent in a PR.

I don't understand ZFS well enough to fully and completely test this. It would take me months to learn your codebase. But hey, it's a start. Now handed it over to you because you're the experts.

TLDR:

  • Discovered by accident
  • Confirmed on live system via debugger
  • Isolated using Gemini 3.1 PRO AI
  • Proven, on a live system, using C-code written by Claude Opus 4.6 AI
  • All code manually reviewed by a polyglot SWE with 25 years experience
  • Solution is, in theory, a two-line change
  • Verifying it works in all scenarios is the ZVS team's job (because it has to work on multiple distros)
  • I'm a real person, I heavily use ZFS, if you need anything else, just ask ... :)

@behlendorf
Copy link
Contributor

I'm wondering why the answer is not to just call zfsvfs_rele()

It's definitely convoluted, but unconditionally calling zfs_vfs_rele() looks like the right fix to me. On Linux the caller will have its own valid reference that needs to be dropped and the racing unmount will have its own. On FreeBSD the man pages for vfs_busy() and vfs_unbusy() seem to say they work the same way. If zfs_vfs_ref() on Windows and macOS behave the same way this may be a one line fix!

@mischivus can you drop the conditional and run this through your "prover" C code snippet stress tester again.

diff --git a/module/zfs/zfs_ioctl.c b/module/zfs/zfs_ioctl.c
index 3bbc9107ae..4fcfd18e50 100644
--- a/module/zfs/zfs_ioctl.c
+++ b/module/zfs/zfs_ioctl.c
@@ -1439,6 +1439,7 @@ zfsvfs_hold(const char *name, const void *tag, zfsvfs_t **zfvp,
                         * objset from the zfsvfs.
                         */
                        ZFS_TEARDOWN_EXIT(*zfvp, tag);
+                       zfs_vfs_rele(*zfvp);
                        return (SET_ERROR(EBUSY));
                }
        }

@mischivus
Copy link
Author

mischivus commented Mar 13, 2026

I can certainly test that, but I'd have to provision a machine - which takes time.

I included both of the scripts on the ticket this PR is attached to ...here they are again so you can test locally.

Let me know if you still need me to run it on one of my franken-servers.

repro_zfsvfs_hold.sh
tight_loop.c

@behlendorf
Copy link
Contributor

I'll see if I can reproduce it locally with your reproducer. In the meantime can you update this PR to drop the conditional.

@mischivus
Copy link
Author

mischivus commented Mar 13, 2026

@behlendorf Updated per your request. Let me know if you cannot reproduce. Need ~2 hours to provision a server in configuration that reliably triggers. If you need me to provision a server, specify any instrumentation/telemetry you would like on it. I do not typically debug filesystems at kernel level ... :)

@behlendorf
Copy link
Contributor

How long does it usually take in your environment? The test has been running for ~600 iterations now and I haven't hit it. I'll leave it running over the weekend, but if I still haven't reproduced by Monday it would be great if you could verify the updated PR in your environment.

@mischivus
Copy link
Author

I was able to trigger it within ~20 iterations.

@mischivus
Copy link
Author

mischivus commented Mar 13, 2026

Since it's Friday, I can just hijack some hardware I have up right now and run it. 80% odds I'll get a hit. Just list off all the instrumentation packages / telemetry I should install so I can give you a "X-ray" of the system so you can have deep insight.

@behlendorf
Copy link
Contributor

If you can install a build of ZFS with the updated one-line patch applied, then verify you're no longer able to reproduce the problem after that's be sufficient. It sounds like a few 100 iterations should be sufficient, but more is better. I still plan to leave my tests running over the weekend.

@mischivus
Copy link
Author

@behlendorf Kill the reproducer running on your system. Its insufficient and will probably never trip. It tripped on my system because of existing system state. I'm writing something more robust, and will post it to this PR thread once I have a "control run" against current build and a "test run" against the updated ZFS build.

@mischivus
Copy link
Author

@behlendorf I've created a robust one-shot "reproducer" that will trigger the original bug within 60 seconds.

However, after compiling my original updated code, I found a second bug #18333, and updated my fork to correct it.

Then after further testing, I found a third bug #18334 - however this one has multiple possible solutions and requires some executive decisions. So I've filed a ticket for it and left it up to the ZFS team to decide how to move forward.

I've attached my updated "reproducer" script and collected evidence to this comment.

phase1_evidence.txt
prover_test.sh
prover.c

@mischivus
Copy link
Author

Note also, "checkstyle" is failing with 'error: missing "Signed-off-by"' ...but all my commits have a correct sign-off.

@Skountz
Copy link

Skountz commented Mar 16, 2026

Note also, "checkstyle" is failing with 'error: missing "Signed-off-by"' ...but all my commits have a correct sign-off.

The checkstyle failure is caused by the Merge branch 'master' into patch-1 commit — GitHub auto-created it when you synced via the UI, and it doesn't include a Signed-off-by: trailer.
The "Verified" badge on it is just GitHub's own GPG signature on the commit object, which is a completely different thing.

If you rebase instead of merge to sync with master, that commit goes away and checkstyle should pass.

When getzfsvfs() succeeds (incrementing s_active via
zfs_vfs_ref()), but z_unmounted is subsequently found to
be B_TRUE, zfsvfs_hold() returns EBUSY without calling
zfs_vfs_rele(). This permanently leaks the VFS superblock
s_active reference, preventing generic_shutdown_super()
from ever firing, which blocks dmu_objset_disown() and
makes the pool permanently unexportable (EBUSY).

Add the missing zfs_vfs_rele() call, guarded by
zfs_vfs_held() to handle the zfsvfs_create() fallback
path where no VFS reference exists. This matches the
existing cleanup pattern in zfsvfs_rele().

Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Drop zfs_vfs_held() guard per reviewer feedback

The conditional is unnecessary: this path is only reachable
via getzfsvfs() success, which always holds a VFS reference.
The zfsvfs_create() fallback sets z_unmounted = B_FALSE, so
it can never enter this block.

Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
When zfsvfs_init() fails during zfs_resume_fs(), the bail
path called zfs_umount() directly. All three callers
(zfs_ioc_rollback, zfs_ioc_recv_impl, and
zfs_ioc_userspace_upgrade) hold an s_active reference
via getzfsvfs() at entry.

This creates two bugs:

1. Deadlock: zfs_umount() -> zfsvfs_teardown() ->
   txg_wait_synced() blocks in uninterruptible D state.
   The superblock cannot tear down because s_active is
   pinned by the calling thread itself. Survives SIGKILL.
   Blocks clean reboot. Requires hard power cycle.

2. Use-after-free: if txg_wait_synced() returns,
   zfs_umount() calls zfsvfs_free(). The caller then
   dereferences the freed zfsvfs via zfs_vfs_rele().

The explicit zfs_umount() is unnecessary. z_unmounted is
already set to B_TRUE before the locks are released, so
all new operations return EBUSY. When the caller releases
its s_active reference via zfs_vfs_rele(), the standard
VFS lifecycle (deactivate_super -> generic_shutdown_super
-> zpl_kill_sb -> zfs_umount) handles teardown correctly.

Refs: openzfs#18309

Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing Status: Feedback requested More information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zfsvfs_hold() leaks VFS s_active reference when z_unmounted is B_TRUE, causing permanent EBUSY on pool export

5 participants