scripts: fix flaky backwards compat test graph sync by ellemouton · Pull Request #10664 · lightningnetwork/lnd

ellemouton · 2026-03-25T13:12:49Z

Summary

Fix an intermittent NO_ROUTE failure in the backwards compatibility test
caused by a race condition in wait_graph_sync.

The function previously only checked that the expected number of channels
appeared in getnetworkinfo. However, a channel can be visible in the graph
before both channel_update messages have arrived — meaning node1_policy
or node2_policy may still be null. When this happens, the pathfinder
cannot construct a complete route and payments fail with NO_ROUTE.

This updates wait_graph_sync to also call describegraph and verify that
every edge has both direction policies populated before declaring the graph
fully synced.

Changes

scripts/bw-compatibility-test/network.sh: After confirming the channel
count matches, additionally check that no edge has a null policy via
describegraph | jq. The loop continues polling until all policies are
present.

Test plan

make backwards-compat-test passes locally
The fix addresses a CI-observed flake where send_payment alice dave
failed with FAILURE_REASON_NO_ROUTE before any node upgrade occurred

The wait_graph_sync function previously only checked that the expected number of channels appeared in getnetworkinfo. However, a channel can be visible in the graph before both channel_update messages have arrived, meaning node1_policy or node2_policy may still be null. When this happens the pathfinder cannot construct a route, causing intermittent NO_ROUTE failures in CI. Update wait_graph_sync to also call describegraph and verify that every edge has both direction policies populated before declaring the graph fully synced.

gemini-code-assist · 2026-03-25T13:14:09Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a persistent flakiness in the backwards compatibility test suite by refining the network synchronization logic. The change ensures that the test environment's Lightning Network graph is truly ready for routing by verifying the completeness of channel policies, thereby eliminating race conditions that previously led to NO_ROUTE errors and improving test reliability.

Highlights

Flaky Test Fix: Resolved an intermittent NO_ROUTE failure in the backwards compatibility test, which was caused by a race condition during graph synchronization.
Improved Graph Synchronization: The wait_graph_sync function now ensures that not only the expected number of channels are present, but also that all channel policies (node1_policy and node2_policy) are fully populated before declaring the graph synced. This prevents issues where channels are visible but not fully routable.
Policy Verification: Introduced a check using describegraph and jq to verify that no edge has a null policy, indicating that channel_update messages have been fully processed.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the wait_graph_sync function in network.sh to ensure that not only the expected number of channels are present, but also that all channels have both direction policies populated, preventing NO_ROUTE failures. The updated logic now queries describegraph to check for missing policies. A review comment points out a potential race condition where describegraph might return an empty object or null edges, causing the loop to break prematurely, and suggests adding a check for the total number of edges.

gemini-code-assist · 2026-03-25T13:18:17Z

scripts/bw-compatibility-test/network.sh

+      missing=$($node describegraph | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length')
+
+      if [[ "$missing" -eq 0 ]]; then
+        echo "👀 $node sees all $num_chans channels with full policies!"
+        break
      fi


There's a potential race condition here. If describegraph returns an empty object ({}) or {"edges": null} because it hasn't updated yet, the jq command will result in missing being 0. This would cause the loop to break prematurely, even though the graph is not fully synced.

To make this more robust, we should also verify that the number of edges in the graph matches the expected number of channels before breaking the loop.

Suggested change

missing=$($node describegraph | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length')

if [[ "$missing" -eq 0 ]]; then

echo "👀 $node sees all $num_chans channels with full policies!"

break

fi

graph_json=$($node describegraph)

num_edges=$(echo "$graph_json" | jq '.edges | length // 0')

missing=$(echo "$graph_json" | jq '[.edges[] | select(.node1_policy == null or .node2_policy == null)] | length')

if [[ $num_edges -eq $num_chans && $missing -eq 0 ]]; then

echo "👀 $node sees all $num_chans channels with full policies!"

break

fi

ziggie1984 · 2026-03-25T14:01:01Z

I am not sure this will fix the problem but I analysed of on the logs setss and the problem was the following:

 Short explanation

  This looks like a gossip sync race on the serving peer.

  Bob had already received both channel_updates for channel 124244814135296, but when Alice queried Bob for that channel, only one update was queryable from Bob’s graph state. Alice
  received that one update, then immediately advanced its gossip filter to a newer timestamp. When Bob later finished processing the missing update, its timestamp was already older than
  Alice’s new filter, so Bob never sent it again.

  Minimal diagram

  Charlie/Dave -> Bob:    Bob receives both channel_updates
  Alice -> Bob:           asks for gossip for channel 124244814135296
  Bob -> Alice:           returns channel_announcement + only 1 channel_update
  Alice:                  sets gossip filter to a newer time
  Bob:                    finishes processing the missing channel_update
  Later sync:             missing update is older than Alice's filter
  Result:                 Alice never gets it

  Timeline

  t1  Bob receives update A and update B
  t2  Alice queries Bob
  t3  Bob can serve only update A
  t4  Alice moves gossip filter forward
  t5  Bob finishes update B
  t6  update B is now behind Alice's filter

  Effect

  Alice graph for SCID 124244814135296:
  - one direction present
  - one direction missing

  => nil policy on one side
  => route construction fails

  Key point

  The race is not primarily “Alice received it but didn’t persist it”.
  It is:

  Bob received it
  but Bob could not serve it yet
  and Alice advanced the filter before Bob could replay it

lightninglabs-deploy · 2026-04-01T14:11:01Z

@ellemouton, remember to re-request review from reviewers when ready

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts: fix flaky backwards compat test graph sync#10664

scripts: fix flaky backwards compat test graph sync#10664
ellemouton wants to merge 1 commit intolightningnetwork:masterfrom
ellemouton:fix-bw-compat-graph-sync

ellemouton commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

ziggie1984 commented Mar 25, 2026

Uh oh!

lightninglabs-deploy commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ellemouton commented Mar 25, 2026

Summary

Changes

Test plan

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Mar 25, 2026

Uh oh!

lightninglabs-deploy commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants