Fix some flaky tests by zuiderkwast · Pull Request #3430 · valkey-io/valkey

zuiderkwast · 2026-04-01T15:27:10Z

Fixing multiple flaky tests.

slave buffer are counted correctly in tests/unit/maxmemory.tcl
Memory efficiency with values in range * in tests/unit/memefficiency.tcl

These tests send large numbers of pipelined commands using deferring clients without reading replies, causing the server's client output buffer to grow. On slow CI runners, this leads to TCP backpressure and I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to suppress reply generation, matching the pattern from commit 87d2330.

Sub-replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl
New non-empty replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl

In the replica-migration tests, a MOVED errors results in an Tcl exception. After failover, wait_for_condition blocks issue GET commands to cluster nodes that may not have fully updated their slot routing. An unhandled MOVED exception crashes the test runner. Fix: Wrap the condition in catch so MOVED errors are retried. Also wrap debug prints in the else clause. Fixes the following tests:

Replica can update the config epoch when trigger the failover - automatic in tests/unit/cluster/failover2.tcl

Increase wait timeout for failover expiry. The test waits 10 seconds for "Failover attempt expired", but the default cluster-node-timeout in start_cluster is 3000ms, making
auth_timeout 6 seconds plus ~3 seconds for failure detection — barely fitting in 10 seconds and failing on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms (60 seconds).

dual-channel-replication lazyfree test

The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes:

Move the connection id lookup before the write loop, while the sync is known to be in progress.
Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit.

zuiderkwast · 2026-04-01T18:43:37Z

Nope. Tests still failing.

codecov · 2026-04-01T23:38:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.55%. Comparing base (8bb8d91) to head (0eecef6).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3430      +/-   ##
============================================
- Coverage     76.82%   76.55%   -0.28%     
============================================
  Files           159      159              
  Lines         79704    79704              
============================================
- Hits          61232    61015     -217     
- Misses        18472    18689     +217

see 26 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The slave buffer test sends 1M pipelined commands without reading replies, causing the server's client output buffer to grow. On slow CI runners this can lead to TCP backpressure and I/O errors. Use CLIENT REPLY OFF so the server doesn't generate replies, avoiding the output buffer buildup entirely. This matches the pattern used in commit 87d2330 for similar tests in memefficiency.tcl. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…ldup The test_memory_efficiency proc sends 10K pipelined SET commands with values up to 16KB without reading replies. This causes the server's client output buffer to grow, leading to TCP backpressure and connection timeouts on slow CI runners. Use CLIENT REPLY OFF to suppress reply generation, matching the pattern used in the slave buffer test fix. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

The wait_for_condition checking key consistency in test_migrated_replica and test_sub_replica can throw an unhandled MOVED exception when cluster nodes haven't fully updated their slot routing after failover. This crashes the entire test run, taking down all parallel tests. Wrap the condition in catch so MOVED errors are treated as not ready yet and retried. Also wrap the debug prints in the else clause in catch since they can throw the same error. This turns a test runner crash into a proper test failure if the condition is never met. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Same fix as for test_sub_replica: wrap the key consistency check in catch so MOVED errors during cluster convergence are retried instead of crashing the test runner. Also wrap debug prints in catch. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

The test waits for 'Failover attempt expired' with 1000*10ms = 10s, but the default cluster-node-timeout in start_cluster is 3000ms, so auth_timeout is 6s plus ~3s for failure detection, totaling ~9s. This is barely enough and fails on slow CI runners. Increase to 1200*50ms = 60s to provide sufficient margin. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

dvkashapov · 2026-04-02T06:28:44Z

+                    if {$k % 10000 == 0} {$rd_master flush}
                }
-                for {set k 0} {$k < $cmd_count} {incr k} {
-                    $rd_master read
-                }
+                $rd_master client reply on
+                $rd_master flush
+                $rd_master read ;# read the +OK from CLIENT REPLY ON


Is flush correct here, shouldn't we do client_reply_off_wait_for_server $rd_master like in 87d2330?

client_reply_off_wait_for_server waits for the server to consume and execute the commands.

I'm not sure it's necessary here. If the replies build up in the $rd_master's incoming TCP buffer, $rd_master will not be able to handle TCP-layer metadata traffic, but now I guess the kernel can handle those TCP sequence numbers and ACK messages better. I'm not 100% sure though.

dvkashapov

Overall LGTM (just one question above), thank you!

The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes: - Move the connection id lookup before the write loop, while the sync is known to be in progress. - Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…t-timeout

Fixing multiple flaky tests. slave buffer are counted correctly in tests/unit/maxmemory.tcl Memory efficiency with values in range * in tests/unit/memefficiency.tcl These tests send large numbers of pipelined commands using deferring clients without reading replies, causing the server's client output buffer to grow. On slow CI runners, this leads to TCP backpressure and I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to suppress reply generation, matching the pattern from commit 87d2330. --- Sub-replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl New non-empty replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl In the replica-migration tests, a MOVED errors results in an Tcl exception. After failover, wait_for_condition blocks issue GET commands to cluster nodes that may not have fully updated their slot routing. An unhandled MOVED exception crashes the test runner. Fix: Wrap the condition in catch so MOVED errors are retried. Also wrap debug prints in the else clause. Fixes the following tests: --- Replica can update the config epoch when trigger the failover - automatic in tests/unit/cluster/failover2.tcl Increase wait timeout for failover expiry. The test waits 10 seconds for "Failover attempt expired", but the default cluster-node-timeout in start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3 seconds for failure detection — barely fitting in 10 seconds and failing on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms (60 seconds). --- dual-channel-replication lazyfree test The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes: 1. Move the connection id lookup before the write loop, while the sync is known to be in progress. 2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Some test cases write thoughsands of commands in a pipeline and afterwards read the replies. This can lead to TCP ACK being dropped and the connection broken. CLIENT REPLY OFF prevents this. "Main db not affected when fail to diskless load" in cluster/diskless-load-swapdb has been observed to be flaky. The others are just defensive but they follow the same pattern. Similar fixes in the past: #3430, #2483. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

sarthakaggarwal97 · 2026-04-15T22:06:07Z

I think we need to backport this PR to stabilize weekly tests. Adding it to 9.0, 8.1, 8.0 and 7.2. It should help with the weekly tests that have been failing.

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0). Resolved memefficiency.tcl conflict: kept $rd close after new CLIENT REPLY pattern. Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>

Partial cherry-pick of f3b6470 from unstable. Only maxmemory.tcl and memefficiency.tcl apply to 7.2. Skipped: dual-channel-replication.tcl, failover2.tcl, replica-migration.tcl (tests/features don't exist on 7.2). Adapted: valkey_deferring_client -> redis_deferring_client. Kept $rd close in memefficiency.tcl (present on 7.2). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0). Resolved memefficiency.tcl conflict: kept $rd close after new CLIENT REPLY pattern. Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Partial cherry-pick of f3b6470 from unstable. Only maxmemory.tcl and memefficiency.tcl apply to 7.2. Skipped: dual-channel-replication.tcl, failover2.tcl, replica-migration.tcl (tests/features don't exist on 7.2). Adapted: valkey_deferring_client -> redis_deferring_client. Kept $rd close in memefficiency.tcl (present on 7.2). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0). Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1). Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Fixing multiple flaky tests. slave buffer are counted correctly in tests/unit/maxmemory.tcl Memory efficiency with values in range * in tests/unit/memefficiency.tcl These tests send large numbers of pipelined commands using deferring clients without reading replies, causing the server's client output buffer to grow. On slow CI runners, this leads to TCP backpressure and I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to suppress reply generation, matching the pattern from commit 87d2330. --- Sub-replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl New non-empty replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl In the replica-migration tests, a MOVED errors results in an Tcl exception. After failover, wait_for_condition blocks issue GET commands to cluster nodes that may not have fully updated their slot routing. An unhandled MOVED exception crashes the test runner. Fix: Wrap the condition in catch so MOVED errors are retried. Also wrap debug prints in the else clause. Fixes the following tests: --- Replica can update the config epoch when trigger the failover - automatic in tests/unit/cluster/failover2.tcl Increase wait timeout for failover expiry. The test waits 10 seconds for "Failover attempt expired", but the default cluster-node-timeout in start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3 seconds for failure detection — barely fitting in 10 seconds and failing on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms (60 seconds). --- dual-channel-replication lazyfree test The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes: 1. Move the connection id lookup before the write loop, while the sync is known to be in progress. 2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Some test cases write thoughsands of commands in a pipeline and afterwards read the replies. This can lead to TCP ACK being dropped and the connection broken. CLIENT REPLY OFF prevents this. "Main db not affected when fail to diskless load" in cluster/diskless-load-swapdb has been observed to be flaky. The others are just defensive but they follow the same pattern. Similar fixes in the past: valkey-io#3430, valkey-io#2483. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0). Resolved memefficiency.tcl: kept $rd close after CLIENT REPLY pattern. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Partial cherry-pick of f3b6470 from unstable. Only maxmemory.tcl and memefficiency.tcl apply to 7.2. Adapted: valkey_deferring_client -> redis_deferring_client. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

asagege · 2026-04-16T23:00:23Z

@zuiderkwast Hey Viktor, another failing test from #3523 for this flaky one:
Sub-replica reports zero repl offset and rank, and fails to win election - sigstop in tests/unit/cluster/replica-migration.tcl
https://github.com/valkey-io/valkey/actions/runs/24524140300/job/71689646619?pr=3523#step:4:9434

Then I re-ran it and it passed, but it did fail sometimes though.

Use CLIENT REPLY OFF pattern in the defrag eval scripts test which pipelines 50,000 script loads + 50,000 set commands. Without this, the TCP send buffer fills up causing 'I/O error reading reply'. Same CLIENT REPLY OFF pattern as valkey-io#3430 and valkey-io#3452. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

github-actions Bot assigned zuiderkwast Apr 1, 2026

zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 1, 2026

zuiderkwast changed the title ~~Reduce command count in replica buffer test and add test keep-alive~~ Try to fix flaky tests Apr 1, 2026

zuiderkwast added 5 commits April 2, 2026 02:47

zuiderkwast force-pushed the fix-maxmemory-test-timeout branch from d288224 to 587566f Compare April 2, 2026 00:48

dvkashapov reviewed Apr 2, 2026

View reviewed changes

dvkashapov approved these changes Apr 2, 2026

View reviewed changes

zuiderkwast added 2 commits April 2, 2026 09:03

Merge remote-tracking branch 'valkey/unstable' into fix-maxmemory-tes…

0eecef6

…t-timeout

zuiderkwast requested a review from enjoy-binbin April 2, 2026 08:20

enjoy-binbin approved these changes Apr 2, 2026

View reviewed changes

zuiderkwast marked this pull request as ready for review April 2, 2026 08:45

zuiderkwast changed the title ~~Try to fix flaky tests~~ Fix some flaky tests Apr 2, 2026

zuiderkwast added the test-failure An issue indicating a test failure label Apr 2, 2026

zuiderkwast merged commit f3b6470 into valkey-io:unstable Apr 2, 2026
63 of 66 checks passed

zuiderkwast deleted the fix-maxmemory-test-timeout branch April 2, 2026 10:01

This was referenced Apr 7, 2026

Fix slot-migration-max-failover-repl-bytes unable to accept -1 #3443

Merged

Fix some flaky tests using CLIENT REPLY OFF #3452

Merged

sarthakaggarwal97 mentioned this pull request Apr 14, 2026

Merge unstable into 9.1 #3507

Closed

sarthakaggarwal97 added this to Valkey 7.2 and Valkey 8.0 Apr 15, 2026

sarthakaggarwal97 added this to Valkey 9.0 and Valkey 8.1 Apr 15, 2026

github-project-automation Bot moved this to To be backported in Valkey 8.0 Apr 15, 2026

github-project-automation Bot moved this to To be backported in Valkey 9.0 Apr 15, 2026

github-project-automation Bot moved this to To be backported in Valkey 7.2 Apr 15, 2026

github-project-automation Bot moved this to To be backported in Valkey 8.1 Apr 15, 2026

zuiderkwast moved this from To be backported to Done in Valkey 9.0 Apr 16, 2026

zuiderkwast moved this from To be backported to Done in Valkey 8.1 Apr 16, 2026

sarthakaggarwal97 mentioned this pull request Apr 16, 2026

Backport Unstable to 9.1 for RC2 #3519

Open

This was referenced Apr 16, 2026

Backport Test Stability Fixes to 7.2 #3524

Closed

Backport test stability fixes to 8.0 #3525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some flaky tests#3430

Fix some flaky tests#3430
zuiderkwast merged 7 commits intovalkey-io:unstablefrom
zuiderkwast:fix-maxmemory-test-timeout

zuiderkwast commented Apr 1, 2026 •

edited

Loading

Uh oh!

zuiderkwast commented Apr 1, 2026

Uh oh!

codecov Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

dvkashapov Apr 2, 2026

Uh oh!

zuiderkwast Apr 2, 2026

Uh oh!

dvkashapov left a comment •

edited

Loading

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 15, 2026 •

edited

Loading

Uh oh!

asagege commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zuiderkwast commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Apr 1, 2026

Uh oh!

codecov Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dvkashapov Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

dvkashapov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagege commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zuiderkwast commented Apr 1, 2026 •

edited

Loading

codecov Bot commented Apr 1, 2026 •

edited

Loading

dvkashapov left a comment •

edited

Loading

sarthakaggarwal97 commented Apr 15, 2026 •

edited

Loading

asagege commented Apr 16, 2026 •

edited

Loading