tap: reduce lock contention in switch packet forwarding#613
Open
nirs wants to merge 4 commits intocontainers:mainfrom
Open
tap: reduce lock contention in switch packet forwarding#613nirs wants to merge 4 commits intocontainers:mainfrom
nirs wants to merge 4 commits intocontainers:mainfrom
Conversation
When txBuf hits ENOBUFS, it retries in a tight loop while the caller (txPkt) holds both writeLock and connLock. Since connLock guards the connection map used by all rx and tx paths, this blocks all network activity until the kernel frees buffer space - which depends on another VM draining its socket. To fix this we need to separate writeLock and connLock so txBuf can retry without holding connLock. As a first step, move the disconnect call out of txBuf into txPkt where connection lifecycle is managed. txBuf was mixing two unrelated concerns: writing to a connection and managing connection lifetime on error. This coupling forced txBuf to take a connection id parameter only to pass it to disconnect, and created an implicit lock ordering dependency (writeLock -> connLock via disconnect) inside a function that should only handle writes. Move disconnect to txPkt which already holds connLock and has the connection context. txBuf becomes a pure write function with no connection id parameter and no lock dependencies beyond writeLock. This is a refactoring step toward separating connLock and writeLock to prevent ENOBUFS retry loops from blocking all connections. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nir Soffer <nirsof@gmail.com>
Move writeLock into txBuf so it is self-contained alongside the write operation and stream protocol header prepending. Previously writeLock was taken in txPkt before connLock, creating a writeLock -> connLock ordering. Now the ordering is connLock -> writeLock since txPkt holds connLock when calling txBuf. connLock is still held during ENOBUFS retry loops in txBuf, blocking all connections. This is a preparation step for releasing connLock before calling txBuf, which will prevent ENOBUFS retry loops from blocking all connections. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nir Soffer <nirsof@gmail.com>
Previously, txPkt() held connLock for the entire transmit path
including the calls to txBuf(). When txBuf() entered the ENOBUFS
retry loop, connLock was held for the entire duration of the loop.
This blocked any goroutine trying to acquire connLock, including:
- The rx goroutine forwarding broadcast/multicast packets via
rxBuf -> tx -> txPkt (blocked on connLock)
- New VM connections via connect() (blocked on connLock)
- VM disconnections via Accept() defer (blocked on connLock)
In the single-VM case (e.g., podman on macOS), the rx goroutine
enters txPkt for every broadcast packet (ARP, DHCP). While it never
actually writes anything (the only connection is the source, so it
is skipped), it was blocked on connLock waiting for the ENOBUFS loop
to finish. This stalled the rx goroutine, preventing it from reading
packets from the socket, eventually filling gvproxy's receive buffer
and causing the peer (Virtualization.framework) to drop packets.
Release connLock before calling txBuf() in both paths:
- Broadcast: snapshot connections to a local slice under connLock,
excluding the source connection. Release the lock and iterate
outside it, calling txBuf() for each target.
- Unicast: copy one connection from the map under connLock, release
the lock, then call txBuf(). If the connection was removed between
the lookup and the write, the write fails and disconnect handles
the cleanup.
On write errors, txPkt takes connLock around the disconnect() call
since disconnect() expects connLock to be held.
Lock operations per packet:
Before - connLock is nested inside writeLock for the entire path:
writeLock
connLock (nested, held during ENOBUFS retry)
After - normal case, no nesting:
connLock (snapshot/lookup, released before write)
writeLock (inside txBuf)
After - error case, connLock taken twice:
connLock (snapshot/lookup)
writeLock (inside txBuf, write fails)
connLock (disconnect)
The error path takes connLock twice, but errors are rare and the
critical change is that connLock is never held during the ENOBUFS
retry loop.
Previously, connLock serialized all access to connections, so
Write() and Close() on the same connection could not race. With
connLock released before txBuf(), a concurrent disconnect (e.g.
from Accept's defer) can call conn.Close() while txBuf() is in a
Write() or ENOBUFS retry. This is safe because Go's net.Conn
interface guarantees concurrent method safety - Close() will cause
any in-progress Write() to return an error, which txBuf() will
propagate to the caller.
With this change, when txBuf() is blocked on ENOBUFS for a VM, the
broadcast flow in the single-VM case is:
1. rx goroutine reads broadcast from socket
2. rxBuf -> tx -> txPkt
3. txPkt takes connLock, snapshots connections excluding source,
releases connLock
4. Source is the only connection, so targets is empty
5. txPkt returns immediately
6. rxBuf delivers broadcast to gVisor (second if block)
7. rx goroutine continues reading from socket
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nir Soffer <nirsof@gmail.com>
disconnect() is called from three places, all of which take connLock
before calling it. Move connLock into disconnect so it manages its
own locking, simplifying all callers.
The Accept() defer is reduced from a 5-line closure to a single
defer statement. The two error paths in txPkt() no longer need
explicit Lock/Unlock around the call.
Note: disconnect still has a lock ordering dependency:
connLock
camLock (nested)
This could be eliminated by removing the defers and using explicit
unlock before taking the next lock, or by extracting smaller
functions for each lock's critical section. This is a pre-existing
issue, not introduced by this change.
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Contributor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nirs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Contributor
Author
|
/cc @cfergeau |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
During an 8-hour ramendr stress test (100 sequential runs), gvproxy's data path
permanently froze at run 40, breaking the podman VM's network:
RamenDR/ramen#2428
The VM's virtio-net TX queue stopped forwarding packets while gvproxy's
TCP accept loop remained alive. New TCP connections were established but
no data was sent. The registry cache inside the VM became unreachable,
causing image pulls to hang and fall back to upstream — matching reports
of intermittent container pull failures.
We cannot fully explain the permanent freeze (a goroutine dump during
the event would be needed), but inspecting the code revealed several
problems in the switch's packet forwarding:
ENOBUFS retry loop blocks the entire switch.
txBuf()retrieswrites in a tight loop on ENOBUFS while the caller holds
connLock,which guards the connection map used by all rx and tx paths. This
blocks all network activity — including the rx goroutine reading
packets from the socket — until the kernel frees buffer space.
Unnecessary lock nesting.
txPkt()held bothwriteLockandconnLockwithdefer, keeping both locked for the entire transmitpath including ENOBUFS retries. The locks serve different purposes
(write serialization vs. connection map access) and don't need to
overlap.
Unnecessary locking for rx data path. In the single-VM case
(e.g., podman on macOS), the rx goroutine enters
txPktfor everybroadcast packet (ARP, DHCP). It never actually writes anything (the
only connection is the source, so it is skipped), but it was blocked
on
connLockwaiting for the ENOBUFS loop to finish. This stalledthe rx goroutine, preventing it from reading packets from the socket,
eventually filling gvproxy's receive buffer and causing the peer
to drop packets.
Changes
Move disconnect from txBuf to txPkt — separate connection
lifecycle management from the write function, removing txBuf's
implicit lock ordering dependency.
Move writeLock from txPkt to txBuf — make txBuf self-contained
with its own lock, preparing for connLock separation.
Release connLock before writing to connections — the core fix.
Snapshot connections under connLock, release it, then write without
holding connLock. ENOBUFS retries no longer block the entire switch.
Move connLock into disconnect — simplify callers by letting
disconnect manage its own locking.
Test Results
Setup
--listen-vfkitunixgram socketvirtio/net/unixgram: Retry on ENOBUFS libkrun#556
All tests ran with no user workloads on the host. Before and after use the
same VM instance, same krunkit, only gvproxy binary is swapped.
Using more vCPUs gives better results since the VM has no workload
competing with the network stack for CPU time.
Host → VM (single stream, 60s)
Results:
host-to-vm-before.json,host-to-vm-after.jsonSingle stream, host to VM. The "after" line sits consistently above
"before" for the entire 60 seconds with no overlap.
This direction goes through gvproxy's gVisor TCP/IP stack, then
txPkt → txBuf → unixgram socket → VM. The improvement comes from
reduced lock contention: txPkt now holds connLock only for a brief
map lookup before releasing it, instead of holding it through the
entire txBuf write path.
VM → Host (single stream, 60s)
Results:
vm-to-host-before.json,vm-to-host-after.jsonSingle stream, VM to host (iperf3 -R). No meaningful difference.
This direction benefits from krunkit's TSO offloading — packets arrive
as 64 KiB datagrams, so per-packet lock overhead is negligible. The
locking fix has little impact when there's only one large packet to
process at a time.
Bidirectional (single stream each direction, 60s)
Results:
bidir-before.json,bidir-after.jsonBidirectional test with one stream in each direction. Both directions
show clear, consistent improvement over the full 60 seconds.
The RX improvement (+13%) is larger than single-stream because under
bidirectional load, the TX path's ENOBUFS retry loops previously
blocked the RX path from entering txPkt (for broadcast forwarding).
With connLock released before txBuf, the RX goroutine passes through
txPkt without waiting for TX writes to complete.
Stress Test (8 streams bidir, zero-copy, 10 minutes)
Results:
stress-before.json,stress-after.jsonAggregate throughput
Aggregate throughput is similar, but the "after" curve is notably
smoother and more consistent over the full 10 minutes. The "before"
curve oscillates as streams compete unfairly for locks.
The headline number is 74% fewer retransmits (945 → 249). Fewer
retransmits means fewer packets were dropped due to gvproxy stalling
under ENOBUFS pressure. This directly addresses the intermittent
container pull failures that motivated this work.
Per-stream fairness (VM → Host)
This is the most striking result. Before: two streams are starved at
0.88 Gbits/sec (5.7x slower than the fastest stream), while other
streams dominate at 5+ Gbits/sec. After: all 8 streams run within
6% of each other.
Standard deviation dropped from 1.89 to 0.09 — a 21x improvement
in fairness.
With the old code, connLock was held through the ENOBUFS retry loop,
creating lock convoys where some streams consistently lost the race
for the lock. With connLock released before writes, all streams get
equal access to the write path.
Per-stream fairness (Host → VM)
Host → VM streams show uneven distribution initially in both runs.
The "after" run converges to a more even distribution in the last
third of the test, while "before" remains spread throughout. The
absolute differences are small (65-103 Mbits/sec range) compared
to the RX direction.
Summary
The locking changes improve throughput in the host → VM direction,
dramatically improve fairness under stress, and reduce packet loss
by 74%. The VM → Host direction (which benefits from TSO offloading)
is unaffected in single-stream tests.
Complete test results:
improved-locking.tar.gz