tap: reduce lock contention in switch packet forwarding by nirs · Pull Request #613 · containers/gvisor-tap-vsock

nirs · 2026-02-20T00:08:21Z

Motivation

During an 8-hour ramendr stress test (100 sequential runs), gvproxy's data path
permanently froze at run 40, breaking the podman VM's network:
RamenDR/ramen#2428

The VM's virtio-net TX queue stopped forwarding packets while gvproxy's
TCP accept loop remained alive. New TCP connections were established but
no data was sent. The registry cache inside the VM became unreachable,
causing image pulls to hang and fall back to upstream — matching reports
of intermittent container pull failures.

We cannot fully explain the permanent freeze (a goroutine dump during
the event would be needed), but inspecting the code revealed several
problems in the switch's packet forwarding:

ENOBUFS retry loop blocks the entire switch. txBuf() retries
writes in a tight loop on ENOBUFS while the caller holds connLock,
which guards the connection map used by all rx and tx paths. This
blocks all network activity — including the rx goroutine reading
packets from the socket — until the kernel frees buffer space.
Unnecessary lock nesting. txPkt() held both writeLock and
connLock with defer, keeping both locked for the entire transmit
path including ENOBUFS retries. The locks serve different purposes
(write serialization vs. connection map access) and don't need to
overlap.
Unnecessary locking for rx data path. In the single-VM case
(e.g., podman on macOS), the rx goroutine enters txPkt for every
broadcast packet (ARP, DHCP). It never actually writes anything (the
only connection is the source, so it is skipped), but it was blocked
on connLock waiting for the ENOBUFS loop to finish. This stalled
the rx goroutine, preventing it from reading packets from the socket,
eventually filling gvproxy's receive buffer and causing the peer
to drop packets.

Changes

Move disconnect from txBuf to txPkt — separate connection
lifecycle management from the write function, removing txBuf's
implicit lock ordering dependency.
Move writeLock from txPkt to txBuf — make txBuf self-contained
with its own lock, preparing for connLock separation.
Release connLock before writing to connections — the core fix.
Snapshot connections under connLock, release it, then write without
holding connLock. ENOBUFS retries no longer block the entire switch.
Move connLock into disconnect — simplify callers by letting
disconnect manage its own locking.

Test Results

Setup

Host: macOS 26.2 (M2 Max)
VM: Ubuntu 25.04 via krunkit with TSO offloading (1 vCPU, 1024 MiB)
Network: gvproxy with --listen-vfkit unixgram socket
krunkit: v1.1.1 built with libkrun from
virtio/net/unixgram: Retry on ENOBUFS libkrun#556
iperf3: version 3.20, server running inside the VM

All tests ran with no user workloads on the host. Before and after use the
same VM instance, same krunkit, only gvproxy binary is swapped.

Using more vCPUs gives better results since the VM has no workload
competing with the network stack for CPU time.

Host → VM (single stream, 60s)

iperf3 -c localhost -t 60 --json

Results: host-to-vm-before.json, host-to-vm-after.json

Metric	Before	After	Change
Avg bitrate	1.67 Gbits/sec	1.87 Gbits/sec	+12%

Single stream, host to VM. The "after" line sits consistently above
"before" for the entire 60 seconds with no overlap.

This direction goes through gvproxy's gVisor TCP/IP stack, then
txPkt → txBuf → unixgram socket → VM. The improvement comes from
reduced lock contention: txPkt now holds connLock only for a brief
map lookup before releasing it, instead of holding it through the
entire txBuf write path.

VM → Host (single stream, 60s)

iperf3 -c localhost -R -t 60 --json

Results: vm-to-host-before.json, vm-to-host-after.json

Metric	Before	After	Change
Avg bitrate	19.9 Gbits/sec	20.0 Gbits/sec	~same

Single stream, VM to host (iperf3 -R). No meaningful difference.

This direction benefits from krunkit's TSO offloading — packets arrive
as 64 KiB datagrams, so per-packet lock overhead is negligible. The
locking fix has little impact when there's only one large packet to
process at a time.

Bidirectional (single stream each direction, 60s)

iperf3 -c localhost --bidir -t 60 --json

Results: bidir-before.json, bidir-after.json

Metric	Before	After	Change
TX (Host → VM)	0.87 Gbits/sec	0.93 Gbits/sec	+7%
RX (VM → Host)	11.8 Gbits/sec	13.4 Gbits/sec	+13%

Bidirectional test with one stream in each direction. Both directions
show clear, consistent improvement over the full 60 seconds.

The RX improvement (+13%) is larger than single-stream because under
bidirectional load, the TX path's ENOBUFS retry loops previously
blocked the RX path from entering txPkt (for broadcast forwarding).
With connLock released before txBuf, the RX goroutine passes through
txPkt without waiting for TX writes to complete.

Stress Test (8 streams bidir, zero-copy, 10 minutes)

iperf3 -c localhost --bidir -Z -P 8 -t 600 --json

Results: stress-before.json, stress-after.json

Aggregate throughput

Metric	Before	After	Change
TX SUM (Host → VM)	678 Mbits/sec	695 Mbits/sec	+2.5%
RX SUM (VM → Host)	29.8 Gbits/sec	28.5 Gbits/sec	-4.5%
Retransmits	945	249	-74%

Aggregate throughput is similar, but the "after" curve is notably
smoother and more consistent over the full 10 minutes. The "before"
curve oscillates as streams compete unfairly for locks.

The headline number is 74% fewer retransmits (945 → 249). Fewer
retransmits means fewer packets were dropped due to gvproxy stalling
under ENOBUFS pressure. This directly addresses the intermittent
container pull failures that motivated this work.

Per-stream fairness (VM → Host)

Metric	Before	After
Min stream	0.884 Gbits/sec	3.43 Gbits/sec
Max stream	5.06 Gbits/sec	3.64 Gbits/sec
Std dev (σ)	1.89	0.09

This is the most striking result. Before: two streams are starved at
0.88 Gbits/sec (5.7x slower than the fastest stream), while other
streams dominate at 5+ Gbits/sec. After: all 8 streams run within
6% of each other.

Standard deviation dropped from 1.89 to 0.09 — a 21x improvement
in fairness.

With the old code, connLock was held through the ENOBUFS retry loop,
creating lock convoys where some streams consistently lost the race
for the lock. With connLock released before writes, all streams get
equal access to the write path.

Per-stream fairness (Host → VM)

Metric	Before	After
Min stream	69.5 Mbits/sec	64.5 Mbits/sec
Max stream	93.0 Mbits/sec	102.6 Mbits/sec
Std dev (σ)	9.30	17.13

Host → VM streams show uneven distribution initially in both runs.
The "after" run converges to a more even distribution in the last
third of the test, while "before" remains spread throughout. The
absolute differences are small (65-103 Mbits/sec range) compared
to the RX direction.

Summary

Test	Before	After	Change
Host → VM (60s)	1.67 Gbits/sec	1.87 Gbits/sec	+12%
VM → Host (60s)	19.9 Gbits/sec	20.0 Gbits/sec	~same
Bidir TX (60s)	0.87 Gbits/sec	0.93 Gbits/sec	+7%
Bidir RX (60s)	11.8 Gbits/sec	13.4 Gbits/sec	+13%
Stress retransmits (10m)	945	249	-74%
Stress RX fairness (10m)	σ=1.89	σ=0.09	21x fairer

The locking changes improve throughput in the host → VM direction,
dramatically improve fairness under stress, and reduce packet loss
by 74%. The VM → Host direction (which benefits from TSO offloading)
is unaffected in single-stream tests.

Complete test results:
improved-locking.tar.gz

When txBuf hits ENOBUFS, it retries in a tight loop while the caller (txPkt) holds both writeLock and connLock. Since connLock guards the connection map used by all rx and tx paths, this blocks all network activity until the kernel frees buffer space - which depends on another VM draining its socket. To fix this we need to separate writeLock and connLock so txBuf can retry without holding connLock. As a first step, move the disconnect call out of txBuf into txPkt where connection lifecycle is managed. txBuf was mixing two unrelated concerns: writing to a connection and managing connection lifetime on error. This coupling forced txBuf to take a connection id parameter only to pass it to disconnect, and created an implicit lock ordering dependency (writeLock -> connLock via disconnect) inside a function that should only handle writes. Move disconnect to txPkt which already holds connLock and has the connection context. txBuf becomes a pure write function with no connection id parameter and no lock dependencies beyond writeLock. This is a refactoring step toward separating connLock and writeLock to prevent ENOBUFS retry loops from blocking all connections. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nir Soffer <nirsof@gmail.com>

Move writeLock into txBuf so it is self-contained alongside the write operation and stream protocol header prepending. Previously writeLock was taken in txPkt before connLock, creating a writeLock -> connLock ordering. Now the ordering is connLock -> writeLock since txPkt holds connLock when calling txBuf. connLock is still held during ENOBUFS retry loops in txBuf, blocking all connections. This is a preparation step for releasing connLock before calling txBuf, which will prevent ENOBUFS retry loops from blocking all connections. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nir Soffer <nirsof@gmail.com>

Previously, txPkt() held connLock for the entire transmit path including the calls to txBuf(). When txBuf() entered the ENOBUFS retry loop, connLock was held for the entire duration of the loop. This blocked any goroutine trying to acquire connLock, including: - The rx goroutine forwarding broadcast/multicast packets via rxBuf -> tx -> txPkt (blocked on connLock) - New VM connections via connect() (blocked on connLock) - VM disconnections via Accept() defer (blocked on connLock) In the single-VM case (e.g., podman on macOS), the rx goroutine enters txPkt for every broadcast packet (ARP, DHCP). While it never actually writes anything (the only connection is the source, so it is skipped), it was blocked on connLock waiting for the ENOBUFS loop to finish. This stalled the rx goroutine, preventing it from reading packets from the socket, eventually filling gvproxy's receive buffer and causing the peer (Virtualization.framework) to drop packets. Release connLock before calling txBuf() in both paths: - Broadcast: snapshot connections to a local slice under connLock, excluding the source connection. Release the lock and iterate outside it, calling txBuf() for each target. - Unicast: copy one connection from the map under connLock, release the lock, then call txBuf(). If the connection was removed between the lookup and the write, the write fails and disconnect handles the cleanup. On write errors, txPkt takes connLock around the disconnect() call since disconnect() expects connLock to be held. Lock operations per packet: Before - connLock is nested inside writeLock for the entire path: writeLock connLock (nested, held during ENOBUFS retry) After - normal case, no nesting: connLock (snapshot/lookup, released before write) writeLock (inside txBuf) After - error case, connLock taken twice: connLock (snapshot/lookup) writeLock (inside txBuf, write fails) connLock (disconnect) The error path takes connLock twice, but errors are rare and the critical change is that connLock is never held during the ENOBUFS retry loop. Previously, connLock serialized all access to connections, so Write() and Close() on the same connection could not race. With connLock released before txBuf(), a concurrent disconnect (e.g. from Accept's defer) can call conn.Close() while txBuf() is in a Write() or ENOBUFS retry. This is safe because Go's net.Conn interface guarantees concurrent method safety - Close() will cause any in-progress Write() to return an error, which txBuf() will propagate to the caller. With this change, when txBuf() is blocked on ENOBUFS for a VM, the broadcast flow in the single-VM case is: 1. rx goroutine reads broadcast from socket 2. rxBuf -> tx -> txPkt 3. txPkt takes connLock, snapshots connections excluding source, releases connLock 4. Source is the only connection, so targets is empty 5. txPkt returns immediately 6. rxBuf delivers broadcast to gVisor (second if block) 7. rx goroutine continues reading from socket Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nir Soffer <nirsof@gmail.com>

disconnect() is called from three places, all of which take connLock before calling it. Move connLock into disconnect so it manages its own locking, simplifying all callers. The Accept() defer is reduced from a 5-line closure to a single defer statement. The two error paths in txPkt() no longer need explicit Lock/Unlock around the call. Note: disconnect still has a lock ordering dependency: connLock camLock (nested) This could be eliminated by removing the defers and using explicit unlock before taking the next lock, or by extracting smaller functions for each lock's critical section. This is a pre-existing issue, not introduced by this change. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Nir Soffer <nirsof@gmail.com>

openshift-ci · 2026-02-20T00:08:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirs
Once this PR has been reviewed and has the lgtm label, please assign cfergeau for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirs · 2026-02-20T05:42:34Z

/cc @cfergeau

nirs and others added 4 commits February 19, 2026 23:11

nirs mentioned this pull request Feb 20, 2026

virtio/net/unixgram: Retry on ENOBUFS containers/libkrun#556

Open

openshift-ci bot requested a review from cfergeau February 20, 2026 05:42

nirs mentioned this pull request Mar 4, 2026

gvproxy brakes under load: virtio_net virtio9 eth0: NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out #617

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tap: reduce lock contention in switch packet forwarding#613

tap: reduce lock contention in switch packet forwarding#613
nirs wants to merge 4 commits intocontainers:mainfrom
nirs:improve-locking

nirs commented Feb 20, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Feb 20, 2026

Uh oh!

nirs commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nirs commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Test Results

Setup

Host → VM (single stream, 60s)

VM → Host (single stream, 60s)

Bidirectional (single stream each direction, 60s)

Stress Test (8 streams bidir, zero-copy, 10 minutes)

Aggregate throughput

Per-stream fairness (VM → Host)

Per-stream fairness (Host → VM)

Summary

Uh oh!

openshift-ci bot commented Feb 20, 2026

Uh oh!

nirs commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nirs commented Feb 20, 2026 •

edited

Loading