Skip to content

Fix version error report race: write directly before disconnect#167

Open
lance0 wants to merge 1 commit intobgp:masterfrom
lance0:fix/version-error-report
Open

Fix version error report race: write directly before disconnect#167
lance0 wants to merge 1 commit intobgp:masterfrom
lance0:fix/version-error-report

Conversation

@lance0
Copy link
Copy Markdown

@lance0 lance0 commented Mar 17, 2026

When a client connects with an unsupported protocol version (e.g., RTR v2 to a v1 server), SendWrongVersionError() puts the Error Report PDU on the async transmits channel, then Disconnect() immediately cancels the context. The sendLoop selects on both transmits and ctx.Done() — if the cancellation fires first, the error report never reaches the wire.

This causes RTR v2 clients to never receive the Error Report with code 4 (Unsupported Protocol Version) per RFC 8210 section 10. Without it, clients cannot learn they should downgrade and instead retry with v2 indefinitely.

Fix

Error Report version byte: The Error Report now carries PROTOCOL_VERSION_1 (the server's highest supported version) instead of echoing back the client's unsupported version. Per RFC 8210 section 7, the client uses this version field to know what to retry with.

Delivery before disconnect: Uses the existing SendWrongVersionError() async channel (preserving write serialization for both TCP and SSH transports) with a brief sleep before Disconnect() to give sendLoop time to flush. The direct c.wr.Write() approach was unsafe for SSH transports (c.wr can be ssh.Channel, which does not document concurrent write safety) and raced with sendLoop.

The sleep is pragmatic — a cleaner approach would be a synchronous flush on the transmits channel. Happy to iterate.

Affected code paths

  • checkVersion() — first PDU from client has unsupported version
  • enforceVersion block in readLoop() — subsequent PDU has wrong version

Testing

  • All existing tests pass: go build ./..., go test ./..., go vet ./...
  • No existing tests cover the version rejection path — new tests for on-wire Error Report version and code would strengthen this but are not included in this patch

Context

Discovered via interop testing with a Rust RTR client (rustbgpd). Client-side workaround: lance0/rustbgpd@c1d0296

Fixes #165

When a client connects with an unsupported protocol version,
SendWrongVersionError() puts the Error Report PDU on the async
transmits channel, then Disconnect() immediately cancels the context.
The send loop races against the cancellation — if ctx.Done() fires
first, the error report never reaches the wire.

Two fixes:
1. Use the server's supported version (not the client's unsupported
   version) in the Error Report, per RFC 8210 §7. In checkVersion(),
   set c.version to PROTOCOL_VERSION_1 before calling
   SendWrongVersionError() so the PDU carries the version the client
   should downgrade to. In the enforceVersion path, c.version already
   holds the server's base version.

2. Add a brief sleep between SendWrongVersionError() and Disconnect()
   to give sendLoop time to drain the error report from the transmits
   channel. This preserves write serialization through the existing
   channel (safe for both TCP and SSH transports) rather than
   bypassing it with a direct write.

Fixes bgp#165
@lance0 lance0 force-pushed the fix/version-error-report branch from 6847f1c to 9499860 Compare March 17, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RTR v2 clients cannot negotiate down to v1 — connection closed without Error Report

1 participant