Skip to content

Stabilize diskless no-drop replication test#3511

Open
sarthakaggarwal97 wants to merge 9 commits intovalkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414
Open

Stabilize diskless no-drop replication test#3511
sarthakaggarwal97 wants to merge 9 commits intovalkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414

Conversation

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Apr 15, 2026

This deflakes all variants of diskless replicas drop during rdb pipe.

The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.

CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.44%. Comparing base (4a42c95) to head (9a86d16).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3511      +/-   ##
============================================
+ Coverage     76.40%   76.44%   +0.04%     
============================================
  Files           159      159              
  Lines         79851    79851              
============================================
+ Hits          61008    61044      +36     
+ Misses        18843    18807      -36     

see 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Nikhil-Manglore
Copy link
Copy Markdown
Member

Nikhil-Manglore commented Apr 15, 2026

Did this test fail recently, I know viktor attempted to fix it in this PR: #3461

Edit: I just saw the comments on the PR, looks like it did fail again recently.

Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?

Comment thread tests/integration/replication.tcl Outdated
Comment thread tests/integration/replication.tcl Outdated
Comment thread tests/integration/replication.tcl Outdated
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 14afd0b to 5a24b23 Compare April 16, 2026 16:45
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

sarthakaggarwal97 commented Apr 16, 2026

@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

sarthakaggarwal97 commented Apr 17, 2026

@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs.

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 56ae8dc to 1e467b0 Compare April 17, 2026 04:04
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

This version is quite stable. Not seeing failures anymore.

@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2026
The 'slow' subcase fallback was searching for a log message matching
'*Connection with replica client id * lost.*' but the actual server log
format is 'Connection with replica <host>:<port> lost.' — there is no
'client id' in the message. Fix the glob to '*Connection with replica * lost.*'.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
On fast ARM CI runners the RDB transfer completes before both replicas
are killed, so the primary logs '2 replicas still up' instead of
'last replica dropped' or '1 replicas still up'. Add a nested catch
fallback to accept all three possible outcomes.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! Sorry for the delay.

set replicas_alive [lreplace $replicas_alive 1 1]
}
if {$all_drop == "all" || $all_drop == "slow"} {
exec kill [srv -1 pid]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added a variable set slow_replica_pid [srv -1 pid]. If you do that, then we should use the variable everywhere with the function to avoid confusion. We should also add a variable set fast_replica_pid [srv 0 pid] and use it for killing the fast replica, etc.

Or just keep using -1 (follow the old pattern) which is less readable but a smaller change.

For backporting, a smaller change is better? We can do either. (We can also just merge it as-is.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants