Stabilize diskless no-drop replication test#3511
Stabilize diskless no-drop replication test#3511sarthakaggarwal97 wants to merge 9 commits intovalkey-io:unstablefrom
Conversation
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3511 +/- ##
============================================
+ Coverage 76.40% 76.44% +0.04%
============================================
Files 159 159
Lines 79851 79851
============================================
+ Hits 61008 61044 +36
+ Misses 18843 18807 -36 🚀 New features to boost your workflow:
|
|
Did this test fail recently, I know viktor attempted to fix it in this PR: #3461 Edit: I just saw the comments on the PR, looks like it did fail again recently. |
zuiderkwast
left a comment
There was a problem hiding this comment.
Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?
14afd0b to
5a24b23
Compare
|
@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934 |
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
5a24b23 to
17730b5
Compare
|
@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs. |
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
56ae8dc to
1e467b0
Compare
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
|
This version is quite stable. Not seeing failures anymore. |
The 'slow' subcase fallback was searching for a log message matching '*Connection with replica client id * lost.*' but the actual server log format is 'Connection with replica <host>:<port> lost.' — there is no 'client id' in the message. Fix the glob to '*Connection with replica * lost.*'. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
On fast ARM CI runners the RDB transfer completes before both replicas are killed, so the primary logs '2 replicas still up' instead of 'last replica dropped' or '1 replicas still up'. Add a nested catch fallback to accept all three possible outcomes. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
Very good! Sorry for the delay.
| set replicas_alive [lreplace $replicas_alive 1 1] | ||
| } | ||
| if {$all_drop == "all" || $all_drop == "slow"} { | ||
| exec kill [srv -1 pid] |
There was a problem hiding this comment.
You added a variable set slow_replica_pid [srv -1 pid]. If you do that, then we should use the variable everywhere with the function to avoid confusion. We should also add a variable set fast_replica_pid [srv 0 pid] and use it for killing the fast replica, etc.
Or just keep using -1 (follow the old pattern) which is less readable but a smaller change.
For backporting, a smaller change is better? We can do either. (We can also just merge it as-is.)
This deflakes all variants of
diskless replicas drop during rdb pipe.The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.
CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515