Simplify hasHealthyEndpoint logic and fix edge cases with reused IPs#2908
Simplify hasHealthyEndpoint logic and fix edge cases with reused IPs#2908oribon merged 2 commits intometallb:mainfrom
Conversation
20f507f to
cac19a5
Compare
|
I've pushed a second commit to fix the CI failures. The |
|
sorry for the late reply! I haven't gone over the exact changes, but ccing @fedepaol to also give this a look as he has done most (or all) of the work regarding how we treat epslices |
speaker/bgp_controller.go
Outdated
| // Only set true if nothing else has expressed an | ||
| // opinion. This means that false will take precedence | ||
| // if there's any unready ports for a given endpoint. | ||
| if epslices.EndpointCanServe(ep.Conditions) { |
There was a problem hiding this comment.
this is hard to follow compared to returning early. The semantic here is, if at least one endpoint with this address is ready, then let's consider it ready.
So, I'd flip the logic to
if if _, ok := ready[addr]; ok { // continue, in case of multiple endpoint slices with the same address, one is enough
continue
}
if epslices.EndpointCanServe() {
ready[addr] = true
}
ready[addr] = falseThis is way more explicit and easy to understand imo
There was a problem hiding this comment.
Thanks for the feedback! I've refactored hasHealthyEndpoint to correctly handle reused IPs as suggested.
|
I'd add a unit test for this scenario. |
I also added a new unit test |
c173d0a to
eb29b7b
Compare
There was a problem hiding this comment.
Pull request overview
Refactors the BGP controller’s endpoint health evaluation to correctly handle EndpointSlice scenarios where multiple endpoints share the same IP (e.g., during KubeVirt VM migrations), and updates/extends tests to reflect the corrected behavior.
Changes:
- Updates
hasHealthyEndpointto treat an IP as healthy if any endpoint for that IP can serve (so ready endpoints “win” over unready ones for the same address). - Fixes the expected advertisement behavior in
TestBGPSpeakerEPSlicesfor the reused-IP/unready+ready mix case. - Adds a focused unit test suite for
hasHealthyEndpointcovering duplicate-IP combinations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
speaker/bgp_controller.go |
Adjusts endpoint health evaluation to avoid false negatives when the same endpoint IP appears with mixed readiness. |
speaker/bgp_controller_test.go |
Updates an existing EPSlices test expectation and adds a dedicated hasHealthyEndpoint unit test matrix. |
Comments suppressed due to low confidence (1)
speaker/bgp_controller.go:161
hasHealthyEndpointreturns a boolean (“any healthy endpoint exists”), so the per-addressreadymap plus the final loop are redundant and add extra allocation/complexity. Consider simplifying to an early return on the first endpoint that can serve after node filtering (or accumulate into a single boolean), which preserves the reused-IP behavior.
func hasHealthyEndpoint(eps []discovery.EndpointSlice, filterNode func(*string) bool) bool {
ready := map[string]bool{}
for _, slice := range eps {
for _, ep := range slice.Endpoints {
node := ep.NodeName
if filterNode(node) {
continue
}
for _, addr := range ep.Addresses {
if ready[addr] {
continue
}
ready[addr] = epslices.EndpointCanServe(ep.Conditions)
}
}
}
for _, r := range ready {
if r {
return true
}
}
return false
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
speaker/bgp_controller.go
Outdated
| if ready[addr] { | ||
| continue | ||
| } | ||
| ready[addr] = epslices.EndpointCanServe(ep.Conditions) |
There was a problem hiding this comment.
The filterNode predicate is used as a skip function (if filterNode(node) { continue }), but the naming/commenting makes it read like it selects nodes to include. Consider renaming the parameter (e.g. skipNode/filterOutNode) or inverting the predicate for clarity so future edits don’t accidentally flip the logic.
There was a problem hiding this comment.
This is existing code and outside the scope of this PR, but I can fix it if maintainers prefer.
|
"Hello @oribon! I've updated PR (fixed merge conflicts). It’s been open for a while, just wanted to make sure it hasn't fallen off the radar. Would appreciate a review when possible!" |
|
@tarabrind can you please squash the commits? |
|
@tarabrind can you please squash the commits? |
eb29b7b to
55bd809
Compare
Done! I've squashed the commits and force-pushed the changes. |
|
|
||
| // hasHealthyEndpoint return true if this node has at least one healthy endpoint. | ||
| // It only checks nodes matching the given filterNode function. | ||
| func hasHealthyEndpoint(eps []discovery.EndpointSlice, filterNode func(*string) bool) bool { |
There was a problem hiding this comment.
nit: sorry if I'm missing something, but can this function be simplified further to:
func hasHealthyEndpoint(eps []discovery.EndpointSlice, filterNode func(*string) bool) bool {
for _, slice := range eps {
for _, ep := range slice.Endpoints {
node := ep.NodeName
if filterNode(node) {
continue
}
if epslices.EndpointCanServe(ep.Conditions) {
return true
}
}
}
return false
}not sure I understand why the ready map is needed with this new behavior
There was a problem hiding this comment.
not sure I understand why the ready map is needed with this new behavior
Thanks for the suggestion! That makes perfect sense.
I used a version very similar to your code snippet. The only minor change I made was adding the check len(ep.Addresses) > 0:
if len(ep.Addresses) > 0 && epslices.EndpointCanServe(ep.Conditions) {
return true
}In the previous implementation, the for _, addr := range ep.Addresses loop implicitly skipped all endpoints that had not yet been assigned IP addresses. By adding len(ep.Addresses) > 0, we ensure that we maintain exactly the same behavior and do not accidentally treat an endpoint without IP addresses as a valid target.
I squashed the commits.
Signed-off-by: Denis Tarabrin <denis.tarabrin@flant.com>
55bd809 to
22f0810
Compare
|
lgtm, thanks! |
Is this a BUG FIX or a FEATURE ?:
/kind bug
What this PR does / why we need it:
In environments like KubeVirt, virtual machines migrating between nodes can lead to multiple endpoints sharing the same IP address. For example, during migration, an old pod might stay in a
Completedstate on the source node while a new one isRunningon the destination node, both sharing the same IP.The previous implementation of
hasHealthyEndpointinbgp_controller.gowas problematic because it tracked readiness via an intermediate map keyed by the endpoint's IP address (ready := map[string]bool{}). This led to redundant logic (readiness belongs to the Endpoint, not the IP) and potential fragility when IPs are reused across different pod lifecycles.Real-world Scenario:
Here is an example of such a state from a live cluster:
Pods:
EndpointSlice entries:
This PR refactors
hasHealthyEndpointto use a simplified map-based approach that groups readiness by IP address. This ensures that:TestBGPSpeakerEPSlices("Endpoint on our node has some unready ports") that was asserting an incorrect behavior for the KubeVirt scenario (it expected no announcement if ANY endpoint for an IP was unready).Special notes for your reviewer:
While this refactoring aligns the BGP controller's intent with the Layer2 controller's
activeEndpointExists, I intentionally kept a simplified map-based approach instead of a pure early return. This was done to:TestBGPSpeakerEPSlices), which expects per-IP readiness evaluation.Release note: