Skip to content

peer/server: Split handshake to synchronous func.#3632

Open
davecgh wants to merge 7 commits intodecred:masterfrom
davecgh:peer_sync_handshake
Open

peer/server: Split handshake to synchronous func.#3632
davecgh wants to merge 7 commits intodecred:masterfrom
davecgh:peer_sync_handshake

Conversation

@davecgh
Copy link
Copy Markdown
Member

@davecgh davecgh commented Mar 3, 2026

This requires #3631.

The current design where the handshake happens asynchronously when the async I/O is started is less than ideal and is quite brittle. It also significantly complicates everything as evidenced by several minor bugs over the years that have resulted from faulty assumptions which directly stem from its asynchronous nature.

For an example of some of the complexity it causes, it means that a bunch of additional flags are required that solely related to the handshake. Namely, whether or not the version if known, whether the verack has been received, and whether the handshake is done. Then, because it's all happening asynchronously, later code has to be vigilant about checking that those events have happened.

All of this complexity can entirely be avoided by simply requiring a successful synchronous handshake to take place prior to starting async I/O.

With that in mind, this significantly reworks the way the handshake is handled so that happens via a separate blocking method and removes async handlers which are no longer required as a result.

The changes have been split in a series a commits to ease the review. Each commit fully compiles, passes all tests, and describes the changes in detail.

The following is a high level overview of the changes:

  • Introduce programmatically detectable errors consistent with other code throughout the repository
  • Accept connection in constructors instead of a separate AssociateConnection
  • Move the handshake code to a separate blocking method named Handshake that accepts a callback to invoke with the
    received version message
    • The new method returns an error that callers can use to reliably detect a failed handshake
    • The callback can return an error to cause the handshake to fail and pass the error along to the caller
  • Make the initial handshake block until both the version and verack message are received
  • Introduce delayed processing for up to 3 messages sent between the version and verack message on old protocol versions
  • Any further received version or verack messages in the async I/O handlers are now unconditionally an error
  • Removes the OnVersion and OnVerAck async listeners that no longer apply
  • Removes the VersionKnown, VerAckReceived, and HandshakeDone methods and associated internal fields that no longer apply
  • Updates the calling server code thread the overall process context down to the Handshake and Run methods
  • Adds several additional tests for correctness
  • Updates the example to clearly show the new semantics
  • Includes extra documentation to elucidate the exact requirements for establishing a new peer as well as exactly which properties the caller can and can't rely on during the handshake
  • Other docs and test cleanup to help make them a little more modern

@davecgh davecgh added this to the 2.2.0 milestone Mar 3, 2026
@davecgh davecgh force-pushed the peer_sync_handshake branch 3 times, most recently from 147855d to 6bdd445 Compare March 4, 2026 01:36
@davecgh davecgh force-pushed the peer_sync_handshake branch 2 times, most recently from 488cb21 to 1daea09 Compare March 6, 2026 18:29
return nil, 0, err
}
return hash, 234439, nil
// Repeat, but in the other direction so the outbound peer has the error.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the benefit of repeating? Isn't this just testing the same code path twice?

Copy link
Copy Markdown
Member Author

@davecgh davecgh Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is calling the same funcs, but in a different order. That isn't at all obvious, especially at this point in the sequence of commits since the handshake is all happening asynchronously. I think maybe it is more obvious in a later commit when it switches over to the synchronous handshake.

The outbound peer always goes first, so, in the case of the first sequence, the inbound peer is already established and blocking on readRemoteVersionMsg until the outbound peer sends the version message at which point the handshake process proceeds and it ends up failing to produce its own local version message.

In the other direction, the peer fails when attempting to go first (via writeLocalVersionMsg).

So, in other words, it forces them both to fail for slightly different reasons depending on which one is going first.

@davecgh davecgh force-pushed the peer_sync_handshake branch 2 times, most recently from 26d472b to d9eeafe Compare March 10, 2026 01:49
Copy link
Copy Markdown
Member Author

@davecgh davecgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review. I've addressed most of the feedback. I'll address the rest a bit later.

return nil, 0, err
}
return hash, 234439, nil
// Repeat, but in the other direction so the outbound peer has the error.
Copy link
Copy Markdown
Member Author

@davecgh davecgh Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is calling the same funcs, but in a different order. That isn't at all obvious, especially at this point in the sequence of commits since the handshake is all happening asynchronously. I think maybe it is more obvious in a later commit when it switches over to the synchronous handshake.

The outbound peer always goes first, so, in the case of the first sequence, the inbound peer is already established and blocking on readRemoteVersionMsg until the outbound peer sends the version message at which point the handshake process proceeds and it ends up failing to produce its own local version message.

In the other direction, the peer fails when attempting to go first (via writeLocalVersionMsg).

So, in other words, it forces them both to fail for slightly different reasons depending on which one is going first.

@davecgh davecgh force-pushed the peer_sync_handshake branch 3 times, most recently from 8fed4dd to 4bd63a7 Compare March 10, 2026 05:18
@davecgh
Copy link
Copy Markdown
Member Author

davecgh commented Mar 14, 2026

I was doing a bit more testing with this and I noticed that a lot of the existing peers are unfortunately not following the handshake protocol correctly 100% of the time, so I'm probably going to have to do some protocol version gating. This is a perfect example of why this change is ultimately such a desirable one:

2026-03-14 05:26:33.419 [DBG] CMGR: Connected to <peeripv4>:9108 (id: 341)
2026-03-14 05:26:33.419 [DBG] PEER: Sending version (agent /dcrwire:1.0.0/dcrd:2.2.0(pre)/, pver 12, block 1062261) to <peeripv4>:9108 (outbound)
...
2026-03-14 05:26:33.649 [DBG] PEER: Received version (agent /dcrwire:1.0.0/dcrd:2.1.2/, pver 11, block 1062262) from <peeripv4>:9108 (outbound)
2026-03-14 05:26:33.649 [DBG] PEER: Negotiated protocol version 11 for peer <peeripv4>:9108 (outbound)
2026-03-14 05:26:33.650 [DBG] PEER: Received getheaders (locator 536c8fffe3ab2fa8c89a70ebfe0ba84f7798e90a1d0bd86b2039223b2f446e62, stop 0000000000000000000000000000000000000000000000000000000000000000) from <peeripv4>:9108 (outbound)
2026-03-14 05:26:33.651 [DBG] CMGR: Disconnected from <peeripv4>:9108 (id: 341)
...
2026-03-14 05:26:33.651 [DBG] SRVR: Failed handshake for outbound peer <peeripv4>:9108: the verack message must follow the version message and precede all others

@davecgh davecgh force-pushed the peer_sync_handshake branch from 4bd63a7 to 6d68148 Compare March 14, 2026 12:29
@davecgh
Copy link
Copy Markdown
Member Author

davecgh commented Mar 14, 2026

I was doing a bit more testing with this and I noticed that a lot of the existing peers are unfortunately not following the handshake protocol correctly 100% of the time, so I'm probably going to have to do some protocol version gating. This is a perfect example of why this change is ultimately such a desirable one:

This is now resolved. I reworked the logic for older protocol versions to allow for up to 3 additional messages that are not the verack during the handshake, but rather than processing than normally, it stores them and delays their processing until the handshake completes and the async I/O is started.

I added an additional commit peer: Refactor inbound message processing. that refactors the primary inbound message processing that checks requirements, updates state, and invokes any configure message handlers into a separate method to support that.

I chose 3 based on observing a bunch of peers and examining the possible messages that can be sent during the handshake. In practice, it is getheaders and getinitstate. Thus, 3 provides one additional message as a buffer.

@davecgh davecgh force-pushed the peer_sync_handshake branch 2 times, most recently from e666d97 to 477d3be Compare March 14, 2026 15:50
davecgh added 7 commits March 17, 2026 01:07
This correct the version in README.md to the most recent released
version and brings the documentation in doc.go to more modern standards.
This does some basic test cleanup and modernizes some of the peer tests
as follows:

- Consolidates the mock peer config used throughout the tests
- Consolidates and simplifies the mock pipe creation
- Marks peer state tests as a helper
- Uses t.Fatalf where appropriate
- Removes additional newlines in failure strings
The majority of the tests in TestOutboundPeer are not actually testing
anything because nothing is checked.  This moves the one thing that is
being tested into a separate test func and removes the rest since it is
already tested elsewhere.
The refactors the primary inbound message processing that checks
requirements, updates state, and invokes any configure message handlers
into a separate method.

This is primarily being done to support an upcoming change that will
need to make use of the same logic before the main read loop.
Due to legacy reasons that no longer apply, connections are currently
associated with a peer after the constructors have been called via
AssociateConnection.

This modifies the code to instead accept the connections in the inbound
and outbound constructors and exports the Start method in its place.

Ultimately, the goal is to split the handshake into a separate method
and convert the lifecycle over to use contexts.
The current design where the handshake happens asynchronously when the
async I/O is started is less than ideal and is quite brittle.  It also
significantly complicates everything as evidenced by several minor bugs
over the years that have resulted from faulty assumptions which directly
stem from its asynchronous nature.

For an example of some of the complexity it causes, it means that a
bunch of additional flags are required that solely related to the
handshake.  Namely, whether or not the version if known, whether the
verack has been received, and whether the handshake is done.  Then,
because it's all happening asynchronously, later code has to be vigilant
about checking that those events have happened.

All of this complexity can entirely be avoided by simply requiring a
successful synchronous handshake to take place prior to starting async
I/O.

With that in mind, this significantly reworks the way the handshake is
handled so that happens via a separate blocking method and removes async
handlers which are no longer required as a result.

The following is a high level overview of the changes:

- Introduce programmatically detectable errors consistent with other
  code throughout the repository
- Move the handshake code to a separate blocking method named Handshake
  that accepts a callback to invoke with the received version message
  - The new method returns an error that callers can use to reliably
    detect a failed handshake
  - The callback can return an error to cause the handshake to fail
    and pass the error along to the caller
- Make the initial handshake block until both the version and verack
  message are received
- Introduce delayed processing for up to 3 messages sent between the
  version and verack message on old protocol versions
- Any further received version or verack messages in the async I/O
  handlers are now unconditionally an error
- Removes the OnVersion and OnVerAck async listeners that no longer apply
- Updates the calling server code thread the overall process context
  down to the handshake and Run methods
- Adds several additional tests for correctness
- Updates the example to clearly show the new semantics
- Includes extra documentation to elucidate the exact requirements for
  establishing a new peer as well as exactly which properties the caller
  can and can't rely on during the handshake
Now that the handshake is required to take place prior to starting
async i/o processing, the version and verack messages are guaranteed to
have been seen for a successful handshake.

Given that, this removes the related fields and methods since they are
no longer needed.
@davecgh davecgh force-pushed the peer_sync_handshake branch from 477d3be to 690479f Compare March 17, 2026 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants