Stabilize diskless no-drop replication test by sarthakaggarwal97 · Pull Request #3511 · valkey-io/valkey

sarthakaggarwal97 · 2026-04-15T04:16:16Z

This deflakes all variants of diskless replicas drop during rdb pipe.

The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.

CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

…0260414

codecov · 2026-04-15T05:57:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.44%. Comparing base (4a42c95) to head (9a86d16).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3511      +/-   ##
============================================
+ Coverage     76.40%   76.44%   +0.04%     
============================================
  Files           159      159              
  Lines         79851    79851              
============================================
+ Hits          61008    61044      +36     
+ Misses        18843    18807      -36

see 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Nikhil-Manglore · 2026-04-15T06:04:56Z

Did this test fail recently, I know viktor attempted to fix it in this PR: #3461

Edit: I just saw the comments on the PR, looks like it did fail again recently.

zuiderkwast

Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?

sarthakaggarwal97 · 2026-04-16T16:50:04Z

@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

sarthakaggarwal97 · 2026-04-17T00:40:37Z

@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs.

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

sarthakaggarwal97 · 2026-04-17T04:39:12Z

This version is quite stable. Not seeing failures anymore.

The 'slow' subcase fallback was searching for a log message matching '*Connection with replica client id * lost.*' but the actual server log format is 'Connection with replica <host>:<port> lost.' — there is no 'client id' in the message. Fix the glob to '*Connection with replica * lost.*'. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

On fast ARM CI runners the RDB transfer completes before both replicas are killed, so the primary logs '2 replicas still up' instead of 'last replica dropped' or '1 replicas still up'. Add a nested catch fallback to accept all three possible outcomes. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

zuiderkwast

Very good! Sorry for the delay.

zuiderkwast · 2026-04-16T19:47:58Z

                        set replicas_alive [lreplace $replicas_alive 1 1]
                    }
                    if {$all_drop == "all" || $all_drop == "slow"} {
                        exec kill [srv -1 pid]


You added a variable set slow_replica_pid [srv -1 pid]. If you do that, then we should use the variable everywhere with the function to avoid confusion. We should also add a variable set fast_replica_pid [srv 0 pid] and use it for killing the fast replica, etc.

Or just keep using -1 (follow the old pattern) which is less readable but a smaller change.

For backporting, a smaller change is better? We can do either. (We can also just merge it as-is.)

sarthakaggarwal97 and others added 2 commits April 14, 2026 19:18

Stabilize diskless no-drop replication test

562a8d9

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Merge branch 'valkey-io:unstable' into daily-repl-rdb-child-timeout-2…

298cd02

…0260414

sarthakaggarwal97 mentioned this pull request Apr 15, 2026

[Flaky Tests] Avoid re-triggering io-thread activation #3509

Merged

github-actions bot assigned sarthakaggarwal97 Apr 15, 2026

sarthakaggarwal97 requested a review from zuiderkwast April 15, 2026 16:34

zuiderkwast reviewed Apr 16, 2026

View reviewed changes

Comment thread tests/integration/replication.tcl Outdated

Comment thread tests/integration/replication.tcl Outdated

Comment thread tests/integration/replication.tcl Outdated

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 14afd0b to 5a24b23 Compare April 16, 2026 16:45

Stabilize diskless no-drop replication test

17730b5

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 5a24b23 to 17730b5 Compare April 16, 2026 18:29

asagege mentioned this pull request Apr 16, 2026

Test repl fix to see if test-ubuntu-latest-cmake-tls passes (forkless) #3523

Open

Deflake diskless replication pipe test

1e467b0

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 56ae8dc to 1e467b0 Compare April 17, 2026 04:04

Give diskless no-drop replicas more reconnect time

7a374a6

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

Merge branch 'unstable' into daily-repl-rdb-child-timeout-20260414

791f7fe

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2026

sarthakaggarwal97 added 3 commits April 17, 2026 17:41

Merge branch 'unstable' into daily-repl-rdb-child-timeout-20260414

9a86d16

zuiderkwast approved these changes Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize diskless no-drop replication test#3511

Stabilize diskless no-drop replication test#3511
sarthakaggarwal97 wants to merge 9 commits intovalkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414

sarthakaggarwal97 commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Nikhil-Manglore commented Apr 15, 2026 •

edited

Loading

Uh oh!

zuiderkwast left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 16, 2026 •

edited

Loading

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026 •

edited

Loading

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026

Uh oh!

zuiderkwast left a comment

Uh oh!

zuiderkwast Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sarthakaggarwal97 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Nikhil-Manglore commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sarthakaggarwal97 commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading

Nikhil-Manglore commented Apr 15, 2026 •

edited

Loading

sarthakaggarwal97 commented Apr 16, 2026 •

edited

Loading

sarthakaggarwal97 commented Apr 17, 2026 •

edited

Loading