Skip to content

Fix some flaky tests#3430

Merged
zuiderkwast merged 7 commits intovalkey-io:unstablefrom
zuiderkwast:fix-maxmemory-test-timeout
Apr 2, 2026
Merged

Fix some flaky tests#3430
zuiderkwast merged 7 commits intovalkey-io:unstablefrom
zuiderkwast:fix-maxmemory-test-timeout

Conversation

@zuiderkwast
Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast commented Apr 1, 2026

Fixing multiple flaky tests.

slave buffer are counted correctly in tests/unit/maxmemory.tcl
Memory efficiency with values in range * in tests/unit/memefficiency.tcl

These tests send large numbers of pipelined commands using deferring clients without reading replies, causing the server's client output buffer to grow. On slow CI runners, this leads to TCP backpressure and I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to suppress reply generation, matching the pattern from commit 87d2330.


Sub-replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl
New non-empty replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl

In the replica-migration tests, a MOVED errors results in an Tcl exception. After failover, wait_for_condition blocks issue GET commands to cluster nodes that may not have fully updated their slot routing. An unhandled MOVED exception crashes the test runner. Fix: Wrap the condition in catch so MOVED errors are retried. Also wrap debug prints in the else clause. Fixes the following tests:


Replica can update the config epoch when trigger the failover - automatic in tests/unit/cluster/failover2.tcl

Increase wait timeout for failover expiry. The test waits 10 seconds for "Failover attempt expired", but the default cluster-node-timeout in start_cluster is 3000ms, making
auth_timeout 6 seconds plus ~3 seconds for failure detection — barely fitting in 10 seconds and failing on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms (60 seconds).


dual-channel-replication lazyfree test

The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes:

  1. Move the connection id lookup before the write loop, while the sync is known to be in progress.
  2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit.

@zuiderkwast zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 1, 2026
@zuiderkwast
Copy link
Copy Markdown
Contributor Author

Nope. Tests still failing.

@zuiderkwast zuiderkwast changed the title Reduce command count in replica buffer test and add test keep-alive Try to fix flaky tests Apr 1, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.55%. Comparing base (8bb8d91) to head (0eecef6).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3430      +/-   ##
============================================
- Coverage     76.82%   76.55%   -0.28%     
============================================
  Files           159      159              
  Lines         79704    79704              
============================================
- Hits          61232    61015     -217     
- Misses        18472    18689     +217     

see 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The slave buffer test sends 1M pipelined commands without reading
replies, causing the server's client output buffer to grow. On slow
CI runners this can lead to TCP backpressure and I/O errors.

Use CLIENT REPLY OFF so the server doesn't generate replies, avoiding
the output buffer buildup entirely. This matches the pattern used in
commit 87d2330 for similar tests in memefficiency.tcl.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…ldup

The test_memory_efficiency proc sends 10K pipelined SET commands with
values up to 16KB without reading replies. This causes the server's
client output buffer to grow, leading to TCP backpressure and
connection timeouts on slow CI runners.

Use CLIENT REPLY OFF to suppress reply generation, matching the
pattern used in the slave buffer test fix.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The wait_for_condition checking key consistency in test_migrated_replica
and test_sub_replica can throw an unhandled MOVED exception when cluster
nodes haven't fully updated their slot routing after failover. This
crashes the entire test run, taking down all parallel tests.

Wrap the condition in catch so MOVED errors are treated as not ready
yet and retried. Also wrap the debug prints in the else clause in
catch since they can throw the same error. This turns a test runner
crash into a proper test failure if the condition is never met.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Same fix as for test_sub_replica: wrap the key consistency check in
catch so MOVED errors during cluster convergence are retried instead
of crashing the test runner. Also wrap debug prints in catch.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The test waits for 'Failover attempt expired' with 1000*10ms = 10s,
but the default cluster-node-timeout in start_cluster is 3000ms, so
auth_timeout is 6s plus ~3s for failure detection, totaling ~9s. This
is barely enough and fails on slow CI runners.

Increase to 1200*50ms = 60s to provide sufficient margin.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast force-pushed the fix-maxmemory-test-timeout branch from d288224 to 587566f Compare April 2, 2026 00:48
Comment thread tests/unit/maxmemory.tcl
Comment on lines +343 to +347
if {$k % 10000 == 0} {$rd_master flush}
}
for {set k 0} {$k < $cmd_count} {incr k} {
$rd_master read
}
$rd_master client reply on
$rd_master flush
$rd_master read ;# read the +OK from CLIENT REPLY ON
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is flush correct here, shouldn't we do client_reply_off_wait_for_server $rd_master like in 87d2330?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client_reply_off_wait_for_server waits for the server to consume and execute the commands.

I'm not sure it's necessary here. If the replies build up in the $rd_master's incoming TCP buffer, $rd_master will not be able to handle TCP-layer metadata traffic, but now I guess the kernel can handle those TCP sequence numbers and ACK messages better. I'm not 100% sure though.

Copy link
Copy Markdown
Member

@dvkashapov dvkashapov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM (just one question above), thank you!

The test looks up the replica's main-channel connection id after
writing 50MB of data. On slow CI runners, the replica connection may
have been disconnected by the output buffer soft limit (64MB/60s)
before the lookup, causing get_client_id_by_last_cmd to return empty.

Two changes:
- Move the connection id lookup before the write loop, while the
  sync is known to be in progress.
- Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs
  enough data to exceed the lazyfree threshold (64 blocks ~= 1MB).
  10MB is sufficient and avoids approaching the output buffer limit.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast requested a review from enjoy-binbin April 2, 2026 08:20
@zuiderkwast zuiderkwast marked this pull request as ready for review April 2, 2026 08:45
@zuiderkwast zuiderkwast changed the title Try to fix flaky tests Fix some flaky tests Apr 2, 2026
@zuiderkwast zuiderkwast added the test-failure An issue indicating a test failure label Apr 2, 2026
@zuiderkwast zuiderkwast merged commit f3b6470 into valkey-io:unstable Apr 2, 2026
63 of 66 checks passed
@zuiderkwast zuiderkwast deleted the fix-maxmemory-test-timeout branch April 2, 2026 10:01
Nikhil-Manglore pushed a commit to Nikhil-Manglore/valkey that referenced this pull request Apr 7, 2026
Fixing multiple flaky tests.

    slave buffer are counted correctly in tests/unit/maxmemory.tcl
    Memory efficiency with values in range * in tests/unit/memefficiency.tcl

These tests send large numbers of pipelined commands using deferring
clients without reading replies, causing the server's client output
buffer to grow. On slow CI runners, this leads to TCP backpressure and
I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to
suppress reply generation, matching the pattern from commit 87d2330.

---

    Sub-replica reports zero repl offset and rank, and fails to win election
    in tests/unit/cluster/replica-migration.tcl
    New non-empty replica reports zero repl offset and rank, and fails to
    win election in tests/unit/cluster/replica-migration.tcl

In the replica-migration tests, a MOVED errors results in an Tcl
exception. After failover, wait_for_condition blocks issue GET commands
to cluster nodes that may not have fully updated their slot routing. An
unhandled MOVED exception crashes the test runner. Fix: Wrap the
condition in catch so MOVED errors are retried. Also wrap debug prints
in the else clause. Fixes the following tests:

---

    Replica can update the config epoch when trigger the failover -
    automatic in tests/unit/cluster/failover2.tcl

Increase wait timeout for failover expiry. The test waits 10 seconds for
"Failover attempt expired", but the default cluster-node-timeout in
start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3
seconds for failure detection — barely fitting in 10 seconds and failing
on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms
(60 seconds).

---

    dual-channel-replication lazyfree test

The test looks up the replica's main-channel connection id after writing
50MB of data. On slow CI runners, the replica connection may have been
disconnected by the output buffer soft limit (64MB/60s) before the
lookup, causing get_client_id_by_last_cmd to return empty. Two changes:

1. Move the connection id lookup before the write loop, while the sync
   is known to be in progress.
2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough
   data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is
   sufficient and avoids approaching the output buffer limit.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
enjoy-binbin pushed a commit that referenced this pull request Apr 8, 2026
Some test cases write thoughsands of commands in a pipeline and
afterwards read the replies. This can lead to TCP ACK being dropped and
the connection broken. CLIENT REPLY OFF prevents this.

"Main db not affected when fail to diskless load" in
cluster/diskless-load-swapdb has been observed to be flaky.

The others are just defensive but they follow the same pattern.

Similar fixes in the past: #3430, #2483.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

sarthakaggarwal97 commented Apr 15, 2026

I think we need to backport this PR to stabilize weekly tests. Adding it to 9.0, 8.1, 8.0 and 7.2. It should help with the weekly tests that have been failing.

@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 8.0 Apr 15, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.0 Apr 15, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 7.2 Apr 15, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 8.1 Apr 15, 2026
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0).
Resolved memefficiency.tcl conflict: kept $rd close after new CLIENT REPLY pattern.

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Partial cherry-pick of f3b6470 from unstable.
Only maxmemory.tcl and memefficiency.tcl apply to 7.2.
Skipped: dual-channel-replication.tcl, failover2.tcl, replica-migration.tcl
(tests/features don't exist on 7.2).
Adapted: valkey_deferring_client -> redis_deferring_client.
Kept $rd close in memefficiency.tcl (present on 7.2).

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1).

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0).

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0).

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1).

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0).
Resolved memefficiency.tcl conflict: kept $rd close after new CLIENT REPLY pattern.

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 15, 2026
Partial cherry-pick of f3b6470 from unstable.
Only maxmemory.tcl and memefficiency.tcl apply to 7.2.
Skipped: dual-channel-replication.tcl, failover2.tcl, replica-migration.tcl
(tests/features don't exist on 7.2).
Adapted: valkey_deferring_client -> redis_deferring_client.
Kept $rd close in memefficiency.tcl (present on 7.2).

Fixes test_slave_buffers and test_memory_efficiency flaky failures
by using CLIENT REPLY OFF to prevent TCP backpressure.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
zuiderkwast pushed a commit that referenced this pull request Apr 16, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on
9.0).

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
zuiderkwast pushed a commit that referenced this pull request Apr 16, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on
8.1).

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@zuiderkwast zuiderkwast moved this from To be backported to Done in Valkey 9.0 Apr 16, 2026
@zuiderkwast zuiderkwast moved this from To be backported to Done in Valkey 8.1 Apr 16, 2026
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
Fixing multiple flaky tests.

    slave buffer are counted correctly in tests/unit/maxmemory.tcl
    Memory efficiency with values in range * in tests/unit/memefficiency.tcl

These tests send large numbers of pipelined commands using deferring
clients without reading replies, causing the server's client output
buffer to grow. On slow CI runners, this leads to TCP backpressure and
I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to
suppress reply generation, matching the pattern from commit 87d2330.

---

    Sub-replica reports zero repl offset and rank, and fails to win election
    in tests/unit/cluster/replica-migration.tcl
    New non-empty replica reports zero repl offset and rank, and fails to
    win election in tests/unit/cluster/replica-migration.tcl

In the replica-migration tests, a MOVED errors results in an Tcl
exception. After failover, wait_for_condition blocks issue GET commands
to cluster nodes that may not have fully updated their slot routing. An
unhandled MOVED exception crashes the test runner. Fix: Wrap the
condition in catch so MOVED errors are retried. Also wrap debug prints
in the else clause. Fixes the following tests:

---

    Replica can update the config epoch when trigger the failover -
    automatic in tests/unit/cluster/failover2.tcl

Increase wait timeout for failover expiry. The test waits 10 seconds for
"Failover attempt expired", but the default cluster-node-timeout in
start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3
seconds for failure detection — barely fitting in 10 seconds and failing
on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms
(60 seconds).

---

    dual-channel-replication lazyfree test

The test looks up the replica's main-channel connection id after writing
50MB of data. On slow CI runners, the replica connection may have been
disconnected by the output buffer soft limit (64MB/60s) before the
lookup, causing get_client_id_by_last_cmd to return empty. Two changes:

1. Move the connection id lookup before the write loop, while the sync
   is known to be in progress.
2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough
   data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is
   sufficient and avoids approaching the output buffer limit.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
Some test cases write thoughsands of commands in a pipeline and
afterwards read the replies. This can lead to TCP ACK being dropped and
the connection broken. CLIENT REPLY OFF prevents this.

"Main db not affected when fail to diskless load" in
cluster/diskless-load-swapdb has been observed to be flaky.

The others are just defensive but they follow the same pattern.

Similar fixes in the past: valkey-io#3430, valkey-io#2483.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
Cherry-pick of f3b6470 from unstable.
Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0).
Resolved memefficiency.tcl: kept $rd close after CLIENT REPLY pattern.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
Partial cherry-pick of f3b6470 from unstable.
Only maxmemory.tcl and memefficiency.tcl apply to 7.2.
Adapted: valkey_deferring_client -> redis_deferring_client.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@asagege
Copy link
Copy Markdown
Contributor

asagege commented Apr 16, 2026

@zuiderkwast Hey Viktor, another failing test from #3523 for this flaky one:
Sub-replica reports zero repl offset and rank, and fails to win election - sigstop in tests/unit/cluster/replica-migration.tcl
https://github.com/valkey-io/valkey/actions/runs/24524140300/job/71689646619?pr=3523#step:4:9434

Then I re-ran it and it passed, but it did fail sometimes though.

sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
Use CLIENT REPLY OFF pattern in the defrag eval scripts test which
pipelines 50,000 script loads + 50,000 set commands. Without this,
the TCP send buffer fills up causing 'I/O error reading reply'.

Same CLIENT REPLY OFF pattern as valkey-io#3430 and valkey-io#3452.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) test-failure An issue indicating a test failure

Projects

Status: To be backported
Status: To be backported
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants