scripts: fix flaky backwards compat test graph sync#10664
scripts: fix flaky backwards compat test graph sync#10664ellemouton wants to merge 1 commit intolightningnetwork:masterfrom
Conversation
The wait_graph_sync function previously only checked that the expected number of channels appeared in getnetworkinfo. However, a channel can be visible in the graph before both channel_update messages have arrived, meaning node1_policy or node2_policy may still be null. When this happens the pathfinder cannot construct a route, causing intermittent NO_ROUTE failures in CI. Update wait_graph_sync to also call describegraph and verify that every edge has both direction policies populated before declaring the graph fully synced.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a persistent flakiness in the backwards compatibility test suite by refining the network synchronization logic. The change ensures that the test environment's Lightning Network graph is truly ready for routing by verifying the completeness of channel policies, thereby eliminating race conditions that previously led to Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the wait_graph_sync function in network.sh to ensure that not only the expected number of channels are present, but also that all channels have both direction policies populated, preventing NO_ROUTE failures. The updated logic now queries describegraph to check for missing policies. A review comment points out a potential race condition where describegraph might return an empty object or null edges, causing the loop to break prematurely, and suggests adding a check for the total number of edges.
| missing=$($node describegraph | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length') | ||
|
|
||
| if [[ "$missing" -eq 0 ]]; then | ||
| echo "👀 $node sees all $num_chans channels with full policies!" | ||
| break | ||
| fi |
There was a problem hiding this comment.
There's a potential race condition here. If describegraph returns an empty object ({}) or {"edges": null} because it hasn't updated yet, the jq command will result in missing being 0. This would cause the loop to break prematurely, even though the graph is not fully synced.
To make this more robust, we should also verify that the number of edges in the graph matches the expected number of channels before breaking the loop.
| missing=$($node describegraph | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length') | |
| if [[ "$missing" -eq 0 ]]; then | |
| echo "👀 $node sees all $num_chans channels with full policies!" | |
| break | |
| fi | |
| graph_json=$($node describegraph) | |
| num_edges=$(echo "$graph_json" | jq '.edges | length // 0') | |
| missing=$(echo "$graph_json" | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length') | |
| if [[ $num_edges -eq $num_chans && $missing -eq 0 ]]; then | |
| echo "👀 $node sees all $num_chans channels with full policies!" | |
| break | |
| fi |
|
I am not sure this will fix the problem but I analysed of on the logs setss and the problem was the following: |
|
@ellemouton, remember to re-request review from reviewers when ready |
Summary
Fix an intermittent
NO_ROUTEfailure in the backwards compatibility testcaused by a race condition in
wait_graph_sync.The function previously only checked that the expected number of channels
appeared in
getnetworkinfo. However, a channel can be visible in the graphbefore both
channel_updatemessages have arrived — meaningnode1_policyor
node2_policymay still benull. When this happens, the pathfindercannot construct a complete route and payments fail with
NO_ROUTE.This updates
wait_graph_syncto also calldescribegraphand verify thatevery edge has both direction policies populated before declaring the graph
fully synced.
Changes
scripts/bw-compatibility-test/network.sh: After confirming the channelcount matches, additionally check that no edge has a
nullpolicy viadescribegraph | jq. The loop continues polling until all policies arepresent.
Test plan
make backwards-compat-testpasses locallysend_payment alice davefailed with
FAILURE_REASON_NO_ROUTEbefore any node upgrade occurred