Skip to content

Skip faster-failover test under TLS #3444

Merged
zuiderkwast merged 1 commit intovalkey-io:unstablefrom
nmvk:cmakes-failure
Apr 8, 2026
Merged

Skip faster-failover test under TLS #3444
zuiderkwast merged 1 commit intovalkey-io:unstablefrom
nmvk:cmakes-failure

Conversation

@nmvk
Copy link
Copy Markdown
Contributor

@nmvk nmvk commented Apr 4, 2026

test-ubuntu-latest-cmake-tls has been consistently failing in https://github.com/valkey-io/valkey/pull/3391/commits and on some other PRs intermittently.

The faster-failover.tcl test starts 12 cluster nodes 5 times with TLS, which is CPU intensive due to the volume of TLS handshakes. Running alongside other tests on CI runners causes random I/O timeouts in unrelated tests.

Tagging it as tls:skip

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.46%. Comparing base (e8ca6c0) to head (563c7d6).
⚠️ Report is 7 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3444      +/-   ##
============================================
- Coverage     76.52%   76.46%   -0.07%     
============================================
  Files           157      157              
  Lines         79025    79025              
============================================
- Hits          60477    60425      -52     
- Misses        18548    18600      +52     

see 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dvkashapov
Copy link
Copy Markdown
Member

@nmvk I believe that test was stabilized in unstable, do we need that tag now?

@nmvk
Copy link
Copy Markdown
Contributor Author

nmvk commented Apr 6, 2026

@dvkashapov #3424 increased the tests internal timeout, but the issue here is different. The test is CPU-heavy (12 nodes × 5 iterations with TLS) and causes unrelated tests to fail from IO starvation. This cmake-tls run for #3391 failed after #3424 was merged.

@dvkashapov
Copy link
Copy Markdown
Member

What if we skip it under TLS tests? Because I don't really want to skip this test in default CI..

12 nodes × 5 iterations this test on TLS causes too many
TLS handshakes can be CPU intensive.

Making this as tls:skip.

Signed-off-by: nmvk <r@nmvk.com>
@nmvk nmvk force-pushed the cmakes-failure branch from e98974a to 563c7d6 Compare April 7, 2026 05:02
@nmvk nmvk changed the title Tag faster-failover test as slow Skip faster-failover test under TLS Apr 7, 2026
@nmvk
Copy link
Copy Markdown
Contributor Author

nmvk commented Apr 7, 2026

Thank you @dvkashapov for the suggestion, tls:skip seems more apt here.

Copy link
Copy Markdown
Member

@Nikhil-Manglore Nikhil-Manglore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zuiderkwast
Copy link
Copy Markdown
Contributor

zuiderkwast commented Apr 7, 2026

In the last Daily, none of the TLS jobs failed:

https://github.com/valkey-io/valkey/actions/runs/24057919763

What differs between test-ubuntu-latest-cmake-tls and the other TLS test jobs?

@nmvk
Copy link
Copy Markdown
Contributor Author

nmvk commented Apr 7, 2026

@zuiderkwast I did see test-ubuntu-latest-cmake-tls pass in many PRs, failures seemed intermittent rather than consistent, my observation is that failures are now much more than before.

gh api "repos/valkey-io/valkey/actions/workflows/ci.yml/runs?created=2026-03-07..2026-04-07&per_page=100" \
  --paginate \
  --jq '.workflow_runs[] | select(.conclusion == "success" or .conclusion == "failure") | .id' \
| xargs -P 20 -I {} gh api "repos/valkey-io/valkey/actions/runs/{}/jobs" \
    --jq '.jobs[] | select(.name == "test-ubuntu-latest-cmake-tls") | [.started_at, .conclusion] | @tsv' \
| awk -F'\t' '
    $1 <  "2026-03-26T17:08" {b[$2]++}
    $1 >= "2026-03-26T17:08" {a[$2]++}
    END {
      printf "Before: %d / %d (%.1f%%)\n", b["failure"], b["failure"]+b["success"], 100*b["failure"]/(b["failure"]+b["success"])
      printf "After:  %d / %d (%.1f%%)\n", a["failure"], a["failure"]+a["success"], 100*a["failure"]/(a["failure"]+a["success"])
    }'

and got:

Before: 50 / 264 (18.9%)
After:  92 / 149 (61.7%)

cmake-tls failure rates went from ~19% to ~62% hence I believe the TLS handshakes in this test (12 nodes × 5 iterations) are heavy enough to cause IO contention with other tests on the same runner.

@zuiderkwast
Copy link
Copy Markdown
Contributor

Yeah, I agree this test is flaky. 😄

It was failing very often before #2811 (because the wrong replica won the election) but that fix improved it. Also as you mentioned I have tried to make it better in #3424 with longer wait for psync, but that was a different issue.

Sure, we can skip it for TLS too, but I would like to understand the root cause. Are you sure it's the TLS handshakes? 12 nodes × 5 iterations doesn't seem that much, compared to all the client connections in the TLS tests.

For I/O starvation, I think tests that loop 10000 times and send commands is heavier. I've tried to fix them like this: #3452

@zuiderkwast zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 7, 2026
Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to skip this test in TLS. Thanks!

@nmvk
Copy link
Copy Markdown
Contributor Author

nmvk commented Apr 8, 2026

@zuiderkwast My original thought was based on the assumption of correlation. However upon digging a little more, I lean to believe this test suite is always on edge with 4 vCPU and 16 workers, oversubscription is high and TLS as a whole makes this more tricky.

I likely think addition of this test changed the order of test run perhaps leading to failures (I don't have any proof :( ). tls:skip still seems apt here though might not solve the failure.

@zuiderkwast zuiderkwast merged commit e6bbd4e into valkey-io:unstable Apr 8, 2026
98 of 100 checks passed
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.1 Apr 8, 2026
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
The faster-failover.tcl test starts 12 cluster nodes 5 times with TLS,
which is CPU intensive due to the volume of TLS handshakes. Running
alongside other tests on CI runners causes random I/O timeouts in
unrelated tests.

Tagging it as `tls:skip`

Signed-off-by: nmvk <r@nmvk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: To be backported

Development

Successfully merging this pull request may close these issues.

5 participants