Skip to content

scripts: fix flaky backwards compat test graph sync#10664

Open
ellemouton wants to merge 1 commit intolightningnetwork:masterfrom
ellemouton:fix-bw-compat-graph-sync
Open

scripts: fix flaky backwards compat test graph sync#10664
ellemouton wants to merge 1 commit intolightningnetwork:masterfrom
ellemouton:fix-bw-compat-graph-sync

Conversation

@ellemouton
Copy link
Copy Markdown
Collaborator

Summary

Fix an intermittent NO_ROUTE failure in the backwards compatibility test
caused by a race condition in wait_graph_sync.

The function previously only checked that the expected number of channels
appeared in getnetworkinfo. However, a channel can be visible in the graph
before both channel_update messages have arrived — meaning node1_policy
or node2_policy may still be null. When this happens, the pathfinder
cannot construct a complete route and payments fail with NO_ROUTE.

This updates wait_graph_sync to also call describegraph and verify that
every edge has both direction policies populated before declaring the graph
fully synced.

Changes

  • scripts/bw-compatibility-test/network.sh: After confirming the channel
    count matches, additionally check that no edge has a null policy via
    describegraph | jq. The loop continues polling until all policies are
    present.

Test plan

  • make backwards-compat-test passes locally
  • The fix addresses a CI-observed flake where send_payment alice dave
    failed with FAILURE_REASON_NO_ROUTE before any node upgrade occurred

The wait_graph_sync function previously only checked that the expected
number of channels appeared in getnetworkinfo. However, a channel can
be visible in the graph before both channel_update messages have
arrived, meaning node1_policy or node2_policy may still be null. When
this happens the pathfinder cannot construct a route, causing
intermittent NO_ROUTE failures in CI.

Update wait_graph_sync to also call describegraph and verify that every
edge has both direction policies populated before declaring the graph
fully synced.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a persistent flakiness in the backwards compatibility test suite by refining the network synchronization logic. The change ensures that the test environment's Lightning Network graph is truly ready for routing by verifying the completeness of channel policies, thereby eliminating race conditions that previously led to NO_ROUTE errors and improving test reliability.

Highlights

  • Flaky Test Fix: Resolved an intermittent NO_ROUTE failure in the backwards compatibility test, which was caused by a race condition during graph synchronization.
  • Improved Graph Synchronization: The wait_graph_sync function now ensures that not only the expected number of channels are present, but also that all channel policies (node1_policy and node2_policy) are fully populated before declaring the graph synced. This prevents issues where channels are visible but not fully routable.
  • Policy Verification: Introduced a check using describegraph and jq to verify that no edge has a null policy, indicating that channel_update messages have been fully processed.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the wait_graph_sync function in network.sh to ensure that not only the expected number of channels are present, but also that all channels have both direction policies populated, preventing NO_ROUTE failures. The updated logic now queries describegraph to check for missing policies. A review comment points out a potential race condition where describegraph might return an empty object or null edges, causing the loop to break prematurely, and suggests adding a check for the total number of edges.

Comment on lines +199 to 204
missing=$($node describegraph | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length')

if [[ "$missing" -eq 0 ]]; then
echo "👀 $node sees all $num_chans channels with full policies!"
break
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential race condition here. If describegraph returns an empty object ({}) or {"edges": null} because it hasn't updated yet, the jq command will result in missing being 0. This would cause the loop to break prematurely, even though the graph is not fully synced.

To make this more robust, we should also verify that the number of edges in the graph matches the expected number of channels before breaking the loop.

Suggested change
missing=$($node describegraph | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length')
if [[ "$missing" -eq 0 ]]; then
echo "👀 $node sees all $num_chans channels with full policies!"
break
fi
graph_json=$($node describegraph)
num_edges=$(echo "$graph_json" | jq '.edges | length // 0')
missing=$(echo "$graph_json" | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length')
if [[ $num_edges -eq $num_chans && $missing -eq 0 ]]; then
echo "👀 $node sees all $num_chans channels with full policies!"
break
fi

@ziggie1984
Copy link
Copy Markdown
Collaborator

I am not sure this will fix the problem but I analysed of on the logs setss and the problem was the following:

 Short explanation

  This looks like a gossip sync race on the serving peer.

  Bob had already received both channel_updates for channel 124244814135296, but when Alice queried Bob for that channel, only one update was queryable from Bob’s graph state. Alice
  received that one update, then immediately advanced its gossip filter to a newer timestamp. When Bob later finished processing the missing update, its timestamp was already older than
  Alice’s new filter, so Bob never sent it again.

  Minimal diagram

  Charlie/Dave -> Bob:    Bob receives both channel_updates
  Alice -> Bob:           asks for gossip for channel 124244814135296
  Bob -> Alice:           returns channel_announcement + only 1 channel_update
  Alice:                  sets gossip filter to a newer time
  Bob:                    finishes processing the missing channel_update
  Later sync:             missing update is older than Alice's filter
  Result:                 Alice never gets it

  Timeline

  t1  Bob receives update A and update B
  t2  Alice queries Bob
  t3  Bob can serve only update A
  t4  Alice moves gossip filter forward
  t5  Bob finishes update B
  t6  update B is now behind Alice's filter

  Effect

  Alice graph for SCID 124244814135296:
  - one direction present
  - one direction missing

  => nil policy on one side
  => route construction fails

  Key point

  The race is not primarily “Alice received it but didn’t persist it”.
  It is:

  Bob received it
  but Bob could not serve it yet
  and Alice advanced the filter before Bob could replay it

@lightninglabs-deploy
Copy link
Copy Markdown
Collaborator

@ellemouton, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants