Skip to content

tap: reduce lock contention in switch packet forwarding#613

Open
nirs wants to merge 4 commits intocontainers:mainfrom
nirs:improve-locking
Open

tap: reduce lock contention in switch packet forwarding#613
nirs wants to merge 4 commits intocontainers:mainfrom
nirs:improve-locking

Conversation

@nirs
Copy link
Contributor

@nirs nirs commented Feb 20, 2026

Motivation

During an 8-hour ramendr stress test (100 sequential runs), gvproxy's data path
permanently froze at run 40, breaking the podman VM's network:
RamenDR/ramen#2428

The VM's virtio-net TX queue stopped forwarding packets while gvproxy's
TCP accept loop remained alive. New TCP connections were established but
no data was sent. The registry cache inside the VM became unreachable,
causing image pulls to hang and fall back to upstream — matching reports
of intermittent container pull failures.

We cannot fully explain the permanent freeze (a goroutine dump during
the event would be needed), but inspecting the code revealed several
problems in the switch's packet forwarding:

  1. ENOBUFS retry loop blocks the entire switch. txBuf() retries
    writes in a tight loop on ENOBUFS while the caller holds connLock,
    which guards the connection map used by all rx and tx paths. This
    blocks all network activity — including the rx goroutine reading
    packets from the socket — until the kernel frees buffer space.

  2. Unnecessary lock nesting. txPkt() held both writeLock and
    connLock with defer, keeping both locked for the entire transmit
    path including ENOBUFS retries. The locks serve different purposes
    (write serialization vs. connection map access) and don't need to
    overlap.

  3. Unnecessary locking for rx data path. In the single-VM case
    (e.g., podman on macOS), the rx goroutine enters txPkt for every
    broadcast packet (ARP, DHCP). It never actually writes anything (the
    only connection is the source, so it is skipped), but it was blocked
    on connLock waiting for the ENOBUFS loop to finish. This stalled
    the rx goroutine, preventing it from reading packets from the socket,
    eventually filling gvproxy's receive buffer and causing the peer
    to drop packets.

Changes

  1. Move disconnect from txBuf to txPkt — separate connection
    lifecycle management from the write function, removing txBuf's
    implicit lock ordering dependency.

  2. Move writeLock from txPkt to txBuf — make txBuf self-contained
    with its own lock, preparing for connLock separation.

  3. Release connLock before writing to connections — the core fix.
    Snapshot connections under connLock, release it, then write without
    holding connLock. ENOBUFS retries no longer block the entire switch.

  4. Move connLock into disconnect — simplify callers by letting
    disconnect manage its own locking.

Test Results

Setup

  • Host: macOS 26.2 (M2 Max)
  • VM: Ubuntu 25.04 via krunkit with TSO offloading (1 vCPU, 1024 MiB)
  • Network: gvproxy with --listen-vfkit unixgram socket
  • krunkit: v1.1.1 built with libkrun from
    virtio/net/unixgram: Retry on ENOBUFS libkrun#556
  • iperf3: version 3.20, server running inside the VM

All tests ran with no user workloads on the host. Before and after use the
same VM instance, same krunkit, only gvproxy binary is swapped.

Using more vCPUs gives better results since the VM has no workload
competing with the network stack for CPU time.

Host → VM (single stream, 60s)

iperf3 -c localhost -t 60 --json

Results: host-to-vm-before.json, host-to-vm-after.json

host-to-vm
Metric Before After Change
Avg bitrate 1.67 Gbits/sec 1.87 Gbits/sec +12%

Single stream, host to VM. The "after" line sits consistently above
"before" for the entire 60 seconds with no overlap.

This direction goes through gvproxy's gVisor TCP/IP stack, then
txPkt → txBuf → unixgram socket → VM. The improvement comes from
reduced lock contention: txPkt now holds connLock only for a brief
map lookup before releasing it, instead of holding it through the
entire txBuf write path.

VM → Host (single stream, 60s)

iperf3 -c localhost -R -t 60 --json

Results: vm-to-host-before.json, vm-to-host-after.json

vm-to-host
Metric Before After Change
Avg bitrate 19.9 Gbits/sec 20.0 Gbits/sec ~same

Single stream, VM to host (iperf3 -R). No meaningful difference.

This direction benefits from krunkit's TSO offloading — packets arrive
as 64 KiB datagrams, so per-packet lock overhead is negligible. The
locking fix has little impact when there's only one large packet to
process at a time.

Bidirectional (single stream each direction, 60s)

iperf3 -c localhost --bidir -t 60 --json

Results: bidir-before.json, bidir-after.json

bidir
Metric Before After Change
TX (Host → VM) 0.87 Gbits/sec 0.93 Gbits/sec +7%
RX (VM → Host) 11.8 Gbits/sec 13.4 Gbits/sec +13%

Bidirectional test with one stream in each direction. Both directions
show clear, consistent improvement over the full 60 seconds.

The RX improvement (+13%) is larger than single-stream because under
bidirectional load, the TX path's ENOBUFS retry loops previously
blocked the RX path from entering txPkt (for broadcast forwarding).
With connLock released before txBuf, the RX goroutine passes through
txPkt without waiting for TX writes to complete.

Stress Test (8 streams bidir, zero-copy, 10 minutes)

iperf3 -c localhost --bidir -Z -P 8 -t 600 --json

Results: stress-before.json, stress-after.json

Aggregate throughput

stress-sum
Metric Before After Change
TX SUM (Host → VM) 678 Mbits/sec 695 Mbits/sec +2.5%
RX SUM (VM → Host) 29.8 Gbits/sec 28.5 Gbits/sec -4.5%
Retransmits 945 249 -74%

Aggregate throughput is similar, but the "after" curve is notably
smoother and more consistent over the full 10 minutes. The "before"
curve oscillates as streams compete unfairly for locks.

The headline number is 74% fewer retransmits (945 → 249). Fewer
retransmits means fewer packets were dropped due to gvproxy stalling
under ENOBUFS pressure. This directly addresses the intermittent
container pull failures that motivated this work.

Per-stream fairness (VM → Host)

stress-rx-streams
Metric Before After
Min stream 0.884 Gbits/sec 3.43 Gbits/sec
Max stream 5.06 Gbits/sec 3.64 Gbits/sec
Std dev (σ) 1.89 0.09

This is the most striking result. Before: two streams are starved at
0.88 Gbits/sec (5.7x slower than the fastest stream), while other
streams dominate at 5+ Gbits/sec. After: all 8 streams run within
6% of each other.

Standard deviation dropped from 1.89 to 0.09 — a 21x improvement
in fairness
.

With the old code, connLock was held through the ENOBUFS retry loop,
creating lock convoys where some streams consistently lost the race
for the lock. With connLock released before writes, all streams get
equal access to the write path.

Per-stream fairness (Host → VM)

stress-tx-streams
Metric Before After
Min stream 69.5 Mbits/sec 64.5 Mbits/sec
Max stream 93.0 Mbits/sec 102.6 Mbits/sec
Std dev (σ) 9.30 17.13

Host → VM streams show uneven distribution initially in both runs.
The "after" run converges to a more even distribution in the last
third of the test, while "before" remains spread throughout. The
absolute differences are small (65-103 Mbits/sec range) compared
to the RX direction.

Summary

Test Before After Change
Host → VM (60s) 1.67 Gbits/sec 1.87 Gbits/sec +12%
VM → Host (60s) 19.9 Gbits/sec 20.0 Gbits/sec ~same
Bidir TX (60s) 0.87 Gbits/sec 0.93 Gbits/sec +7%
Bidir RX (60s) 11.8 Gbits/sec 13.4 Gbits/sec +13%
Stress retransmits (10m) 945 249 -74%
Stress RX fairness (10m) σ=1.89 σ=0.09 21x fairer

The locking changes improve throughput in the host → VM direction,
dramatically improve fairness under stress, and reduce packet loss
by 74%. The VM → Host direction (which benefits from TSO offloading)
is unaffected in single-stream tests.

Complete test results:
improved-locking.tar.gz

nirs and others added 4 commits February 19, 2026 23:11
When txBuf hits ENOBUFS, it retries in a tight loop while the caller
(txPkt) holds both writeLock and connLock. Since connLock guards the
connection map used by all rx and tx paths, this blocks all network
activity until the kernel frees buffer space - which depends on another
VM draining its socket.

To fix this we need to separate writeLock and connLock so txBuf can
retry without holding connLock. As a first step, move the disconnect
call out of txBuf into txPkt where connection lifecycle is managed.

txBuf was mixing two unrelated concerns: writing to a connection and
managing connection lifetime on error. This coupling forced txBuf to
take a connection id parameter only to pass it to disconnect, and
created an implicit lock ordering dependency (writeLock -> connLock
via disconnect) inside a function that should only handle writes.

Move disconnect to txPkt which already holds connLock and has the
connection context. txBuf becomes a pure write function with no
connection id parameter and no lock dependencies beyond writeLock.

This is a refactoring step toward separating connLock and writeLock
to prevent ENOBUFS retry loops from blocking all connections.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Move writeLock into txBuf so it is self-contained alongside the write
operation and stream protocol header prepending.

Previously writeLock was taken in txPkt before connLock, creating a
writeLock -> connLock ordering. Now the ordering is connLock -> writeLock
since txPkt holds connLock when calling txBuf. connLock is still held
during ENOBUFS retry loops in txBuf, blocking all connections.

This is a preparation step for releasing connLock before calling txBuf,
which will prevent ENOBUFS retry loops from blocking all connections.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Previously, txPkt() held connLock for the entire transmit path
including the calls to txBuf(). When txBuf() entered the ENOBUFS
retry loop, connLock was held for the entire duration of the loop.
This blocked any goroutine trying to acquire connLock, including:
- The rx goroutine forwarding broadcast/multicast packets via
  rxBuf -> tx -> txPkt (blocked on connLock)
- New VM connections via connect() (blocked on connLock)
- VM disconnections via Accept() defer (blocked on connLock)

In the single-VM case (e.g., podman on macOS), the rx goroutine
enters txPkt for every broadcast packet (ARP, DHCP). While it never
actually writes anything (the only connection is the source, so it
is skipped), it was blocked on connLock waiting for the ENOBUFS loop
to finish. This stalled the rx goroutine, preventing it from reading
packets from the socket, eventually filling gvproxy's receive buffer
and causing the peer (Virtualization.framework) to drop packets.

Release connLock before calling txBuf() in both paths:

- Broadcast: snapshot connections to a local slice under connLock,
  excluding the source connection. Release the lock and iterate
  outside it, calling txBuf() for each target.

- Unicast: copy one connection from the map under connLock, release
  the lock, then call txBuf(). If the connection was removed between
  the lookup and the write, the write fails and disconnect handles
  the cleanup.

On write errors, txPkt takes connLock around the disconnect() call
since disconnect() expects connLock to be held.

Lock operations per packet:

Before - connLock is nested inside writeLock for the entire path:
  writeLock
      connLock (nested, held during ENOBUFS retry)

After - normal case, no nesting:
  connLock (snapshot/lookup, released before write)
  writeLock (inside txBuf)

After - error case, connLock taken twice:
  connLock (snapshot/lookup)
  writeLock (inside txBuf, write fails)
  connLock (disconnect)

The error path takes connLock twice, but errors are rare and the
critical change is that connLock is never held during the ENOBUFS
retry loop.

Previously, connLock serialized all access to connections, so
Write() and Close() on the same connection could not race. With
connLock released before txBuf(), a concurrent disconnect (e.g.
from Accept's defer) can call conn.Close() while txBuf() is in a
Write() or ENOBUFS retry. This is safe because Go's net.Conn
interface guarantees concurrent method safety - Close() will cause
any in-progress Write() to return an error, which txBuf() will
propagate to the caller.

With this change, when txBuf() is blocked on ENOBUFS for a VM, the
broadcast flow in the single-VM case is:

1. rx goroutine reads broadcast from socket
2. rxBuf -> tx -> txPkt
3. txPkt takes connLock, snapshots connections excluding source,
   releases connLock
4. Source is the only connection, so targets is empty
5. txPkt returns immediately
6. rxBuf delivers broadcast to gVisor (second if block)
7. rx goroutine continues reading from socket

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nir Soffer <nirsof@gmail.com>
disconnect() is called from three places, all of which take connLock
before calling it. Move connLock into disconnect so it manages its
own locking, simplifying all callers.

The Accept() defer is reduced from a 5-line closure to a single
defer statement. The two error paths in txPkt() no longer need
explicit Lock/Unlock around the call.

Note: disconnect still has a lock ordering dependency:
  connLock
      camLock (nested)

This could be eliminated by removing the defers and using explicit
unlock before taking the next lock, or by extracting smaller
functions for each lock's critical section. This is a pre-existing
issue, not introduced by this change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Nir Soffer <nirsof@gmail.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirs
Once this PR has been reviewed and has the lgtm label, please assign cfergeau for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nirs
Copy link
Contributor Author

nirs commented Feb 20, 2026

/cc @cfergeau

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant