Skip faster-failover test under TLS #3444
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3444 +/- ##
============================================
- Coverage 76.52% 76.46% -0.07%
============================================
Files 157 157
Lines 79025 79025
============================================
- Hits 60477 60425 -52
- Misses 18548 18600 +52 🚀 New features to boost your workflow:
|
|
@nmvk I believe that test was stabilized in unstable, do we need that tag now? |
|
@dvkashapov #3424 increased the tests internal timeout, but the issue here is different. The test is CPU-heavy (12 nodes × 5 iterations with TLS) and causes unrelated tests to fail from IO starvation. This cmake-tls run for #3391 failed after #3424 was merged. |
|
What if we skip it under TLS tests? Because I don't really want to skip this test in default CI.. |
12 nodes × 5 iterations this test on TLS causes too many TLS handshakes can be CPU intensive. Making this as tls:skip. Signed-off-by: nmvk <r@nmvk.com>
|
Thank you @dvkashapov for the suggestion, |
|
In the last Daily, none of the TLS jobs failed: https://github.com/valkey-io/valkey/actions/runs/24057919763 What differs between |
|
@zuiderkwast I did see gh api "repos/valkey-io/valkey/actions/workflows/ci.yml/runs?created=2026-03-07..2026-04-07&per_page=100" \
--paginate \
--jq '.workflow_runs[] | select(.conclusion == "success" or .conclusion == "failure") | .id' \
| xargs -P 20 -I {} gh api "repos/valkey-io/valkey/actions/runs/{}/jobs" \
--jq '.jobs[] | select(.name == "test-ubuntu-latest-cmake-tls") | [.started_at, .conclusion] | @tsv' \
| awk -F'\t' '
$1 < "2026-03-26T17:08" {b[$2]++}
$1 >= "2026-03-26T17:08" {a[$2]++}
END {
printf "Before: %d / %d (%.1f%%)\n", b["failure"], b["failure"]+b["success"], 100*b["failure"]/(b["failure"]+b["success"])
printf "After: %d / %d (%.1f%%)\n", a["failure"], a["failure"]+a["success"], 100*a["failure"]/(a["failure"]+a["success"])
}'
and got: cmake-tls failure rates went from ~19% to ~62% hence I believe the TLS handshakes in this test (12 nodes × 5 iterations) are heavy enough to cause IO contention with other tests on the same runner. |
|
Yeah, I agree this test is flaky. 😄 It was failing very often before #2811 (because the wrong replica won the election) but that fix improved it. Also as you mentioned I have tried to make it better in #3424 with longer wait for psync, but that was a different issue. Sure, we can skip it for TLS too, but I would like to understand the root cause. Are you sure it's the TLS handshakes? 12 nodes × 5 iterations doesn't seem that much, compared to all the client connections in the TLS tests. For I/O starvation, I think tests that loop 10000 times and send commands is heavier. I've tried to fix them like this: #3452 |
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
I think it is fine to skip this test in TLS. Thanks!
|
@zuiderkwast My original thought was based on the assumption of correlation. However upon digging a little more, I lean to believe this test suite is always on edge with 4 vCPU and 16 workers, oversubscription is high and TLS as a whole makes this more tricky. I likely think addition of this test changed the order of test run perhaps leading to failures (I don't have any proof :( ). |
The faster-failover.tcl test starts 12 cluster nodes 5 times with TLS, which is CPU intensive due to the volume of TLS handshakes. Running alongside other tests on CI runners causes random I/O timeouts in unrelated tests. Tagging it as `tls:skip` Signed-off-by: nmvk <r@nmvk.com>
test-ubuntu-latest-cmake-tlshas been consistently failing in https://github.com/valkey-io/valkey/pull/3391/commits and on some other PRs intermittently.The faster-failover.tcl test starts 12 cluster nodes 5 times with TLS, which is CPU intensive due to the volume of TLS handshakes. Running alongside other tests on CI runners causes random I/O timeouts in unrelated tests.
Tagging it as
tls:skip