Fix s_active leak in zfsvfs_hold() when z_unmounted is true#18310
Fix s_active leak in zfsvfs_hold() when z_unmounted is true#18310mischivus wants to merge 3 commits intoopenzfs:masterfrom
Conversation
|
Good find - it seems to me that FreeBSD, Windows and macOS has the same code and would need similar fixes. |
|
Mm, interesting. #18310 seems legit, though its extremely difficult to follow. At least some of that is the whole interaction between the ioctls and zfsvfs being a mass of spaghetti and tears. I wonder, did this actually come up in production, or was it discovered via code read, static analysis, some other means? That said, I agree there's a leak of some sort here, but I'm not entirely sure if this is the right way to close it. I'm wondering why the answer is not to just call It might not be an issue at all, of course, but the path here is not obvious to me without a lot more study. And, I wonder if that's anything to do with the failing tests, of which many seem to involve unmounts and exports. |
|
I assumed, perhaps incorrectly, that |
|
I discovered this entirely by accident, using ZFS in a really unusual configuration with an optane card, a spinning platter array and whatnot. I'm an experienced SWE (25 years) and C is one of my languages. After spending about a hundred hours on a deployment script testing EVERY possible workaround, I isolated it to something kernel-level with the filesystem, then eventually ZFS. So I pulled down the symbols/codebase for ZFS - which is enormous - and poked around trying to trigger it with a debugger attached. Which I was able to do intermittently. So I knew I was on to something. At that point, not wanting to learn the entire ZFS codebase, I put 150 bucks into the "Google Cannon" and had Gemini 3.1 PRO go at it using static-dynamic analysis (it can DO that now!), based on the specific debug traces. Gemini found the problem - but it was a race condition - and would be a PAIN to prove. And obviously I didn't want to waste your time with AI slop. So I used Claude Opus 4.6 to craft a "prover" C code snippet, and the automation to absolutely HAMMER the system with it - because to get this to trigger the requests have to land within maybe 100 nanoseconds of each other (this is a guess not an exact measurement). I let that cook for a few runs, and on my hardware after a few minutes it consistently got into into the locked state. Depending on your hardware you may have to run the "prover" code 20+ times. At this point I was about ten hours in, so I figured "hey, why not, let's write this one up for them!". I read through the general flow of the code to get a "somewhat working" understanding of how that portion of ZFS worked, then I went through the code I'd generated and hand-checked everything. All looked good. I wrote a 2-liner patch for you guys, wrote-up a ticket, and sent in a PR. I don't understand ZFS well enough to fully and completely test this. It would take me months to learn your codebase. But hey, it's a start. Now handed it over to you because you're the experts. TLDR:
|
It's definitely convoluted, but unconditionally calling @mischivus can you drop the conditional and run this through your "prover" C code snippet stress tester again. diff --git a/module/zfs/zfs_ioctl.c b/module/zfs/zfs_ioctl.c
index 3bbc9107ae..4fcfd18e50 100644
--- a/module/zfs/zfs_ioctl.c
+++ b/module/zfs/zfs_ioctl.c
@@ -1439,6 +1439,7 @@ zfsvfs_hold(const char *name, const void *tag, zfsvfs_t **zfvp,
* objset from the zfsvfs.
*/
ZFS_TEARDOWN_EXIT(*zfvp, tag);
+ zfs_vfs_rele(*zfvp);
return (SET_ERROR(EBUSY));
}
} |
|
I can certainly test that, but I'd have to provision a machine - which takes time. I included both of the scripts on the ticket this PR is attached to ...here they are again so you can test locally. Let me know if you still need me to run it on one of my franken-servers. |
|
I'll see if I can reproduce it locally with your reproducer. In the meantime can you update this PR to drop the conditional. |
|
@behlendorf Updated per your request. Let me know if you cannot reproduce. Need ~2 hours to provision a server in configuration that reliably triggers. If you need me to provision a server, specify any instrumentation/telemetry you would like on it. I do not typically debug filesystems at kernel level ... :) |
|
How long does it usually take in your environment? The test has been running for ~600 iterations now and I haven't hit it. I'll leave it running over the weekend, but if I still haven't reproduced by Monday it would be great if you could verify the updated PR in your environment. |
|
I was able to trigger it within ~20 iterations. |
|
Since it's Friday, I can just hijack some hardware I have up right now and run it. 80% odds I'll get a hit. Just list off all the instrumentation packages / telemetry I should install so I can give you a "X-ray" of the system so you can have deep insight. |
|
If you can install a build of ZFS with the updated one-line patch applied, then verify you're no longer able to reproduce the problem after that's be sufficient. It sounds like a few 100 iterations should be sufficient, but more is better. I still plan to leave my tests running over the weekend. |
|
@behlendorf Kill the reproducer running on your system. Its insufficient and will probably never trip. It tripped on my system because of existing system state. I'm writing something more robust, and will post it to this PR thread once I have a "control run" against current build and a "test run" against the updated ZFS build. |
|
@behlendorf I've created a robust one-shot "reproducer" that will trigger the original bug within 60 seconds. However, after compiling my original updated code, I found a second bug #18333, and updated my fork to correct it. Then after further testing, I found a third bug #18334 - however this one has multiple possible solutions and requires some executive decisions. So I've filed a ticket for it and left it up to the ZFS team to decide how to move forward. I've attached my updated "reproducer" script and collected evidence to this comment. |
|
Note also, "checkstyle" is failing with 'error: missing "Signed-off-by"' ...but all my commits have a correct sign-off. |
The checkstyle failure is caused by the If you rebase instead of merge to sync with master, that commit goes away and checkstyle should pass. |
When getzfsvfs() succeeds (incrementing s_active via zfs_vfs_ref()), but z_unmounted is subsequently found to be B_TRUE, zfsvfs_hold() returns EBUSY without calling zfs_vfs_rele(). This permanently leaks the VFS superblock s_active reference, preventing generic_shutdown_super() from ever firing, which blocks dmu_objset_disown() and makes the pool permanently unexportable (EBUSY). Add the missing zfs_vfs_rele() call, guarded by zfs_vfs_held() to handle the zfsvfs_create() fallback path where no VFS reference exists. This matches the existing cleanup pattern in zfsvfs_rele(). Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Drop zfs_vfs_held() guard per reviewer feedback The conditional is unnecessary: this path is only reachable via getzfsvfs() success, which always holds a VFS reference. The zfsvfs_create() fallback sets z_unmounted = B_FALSE, so it can never enter this block. Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
When zfsvfs_init() fails during zfs_resume_fs(), the bail path called zfs_umount() directly. All three callers (zfs_ioc_rollback, zfs_ioc_recv_impl, and zfs_ioc_userspace_upgrade) hold an s_active reference via getzfsvfs() at entry. This creates two bugs: 1. Deadlock: zfs_umount() -> zfsvfs_teardown() -> txg_wait_synced() blocks in uninterruptible D state. The superblock cannot tear down because s_active is pinned by the calling thread itself. Survives SIGKILL. Blocks clean reboot. Requires hard power cycle. 2. Use-after-free: if txg_wait_synced() returns, zfs_umount() calls zfsvfs_free(). The caller then dereferences the freed zfsvfs via zfs_vfs_rele(). The explicit zfs_umount() is unnecessary. z_unmounted is already set to B_TRUE before the locks are released, so all new operations return EBUSY. When the caller releases its s_active reference via zfs_vfs_rele(), the standard VFS lifecycle (deactivate_super -> generic_shutdown_super -> zpl_kill_sb -> zfs_umount) handles teardown correctly. Refs: openzfs#18309 Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Fixes #18309
When
getzfsvfs()succeeds (incrementings_activeviazfs_vfs_ref()), butz_unmountedis subsequently found to beB_TRUE,zfsvfs_hold()returnsEBUSYwithout callingzfs_vfs_rele(). This permanently leaks the VFS superblocks_activereference, preventinggeneric_shutdown_super()from ever firing, which blocksdmu_objset_disown()and makes the pool permanently unexportable.Add the missing
zfs_vfs_rele()call, guarded byzfs_vfs_held()to handle thezfsvfs_create()fallback path where no VFS reference exists. This matches the existing cleanup pattern inzfsvfs_rele()immediately below in the same file.See #18309 for full root cause analysis, reachability proof, and live system evidence including kernel module s_active read.