Skip to content

Fix TLS infinite busy loop when write/read handlers are removed#3510

Open
yairgott wants to merge 5 commits intovalkey-io:unstablefrom
yairgott:tls_want_wait_fix
Open

Fix TLS infinite busy loop when write/read handlers are removed#3510
yairgott wants to merge 5 commits intovalkey-io:unstablefrom
yairgott:tls_want_wait_fix

Conversation

@yairgott
Copy link
Copy Markdown
Contributor

@yairgott yairgott commented Apr 15, 2026

In tls.c, if OpenSSL hits SSL_ERROR_WANT_WRITE while trying to trigger a read operation, the internal layer tags conn->flags with TLS_CONN_FLAG_READ_WANT_WRITE.

If the application layer subsequently removes the read handler (i.e., sets read_handler to NULL), the original framework left TLS_CONN_FLAG_READ_WANT_WRITE intact. Thus:

updateSSLEvent sees TLS_CONN_FLAG_READ_WANT_WRITE and keeps AE_WRITABLE active on the file descriptor forever.
When the kernel fires the writable event, tlsEventHandler checks call_read, which passes because TLS_CONN_FLAG_READ_WANT_WRITE is set.
It tries to invoke the read_handler (which is now NULL), returns safely, and loops indefinitely inside aeProcessEvents consuming 100% CPU!
By guaranteeing these internal tags reset whenever high-level handlers clear, the application correctly drops AE_WRITABLE/AE_READABLE state!

Added a reproduction test in src/unit/test_tls.cpp.

Signed-off-by: Yair Gottdenker <yairg@google.com>
@yairgott yairgott marked this pull request as draft April 15, 2026 02:23
Yair Gottdenker added 2 commits April 15, 2026 02:30
Signed-off-by: Yair Gottdenker <yairg@google.com>
…ls_want_wait_fix

Signed-off-by: Yair Gottdenker <yairg@google.com>
@yairgott yairgott marked this pull request as ready for review April 15, 2026 02:53
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (baa5a48) to head (f27301b).
⚠️ Report is 9 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3510      +/-   ##
============================================
+ Coverage     76.48%   76.62%   +0.14%     
============================================
  Files           159      159              
  Lines         79815    79927     +112     
============================================
+ Hits          61043    61244     +201     
+ Misses        18772    18683      -89     
Files with missing lines Coverage Δ
src/tls.c 17.64% <ø> (ø)

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/tls.c Outdated
static int connTLSSetWriteHandler(connection *conn, ConnectionCallbackFunc func, int barrier) {
tls_connection *tls_conn = (tls_connection *)conn;
conn->write_handler = func;
if (!func) tls_conn->flags &= ~TLS_CONN_FLAG_WRITE_WANT_READ;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon protectClient and unprotectClient flag state is lost, This can result in incorrect wiring in https://github.com/valkey-io/valkey/blob/unstable/src/tls.c#L1239

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you elaborate on this? FWIW, unless the flag is unset when the callback is set to NULL, the main thread enters a busy wait loop.

I think the TLS code code requires some changes:

  1. It would be more robust if conn->write_handler/conn->read_handler are set only in updateSSLEvent based on the conn->flags otherwise there is risk of a brain split between the connection flags and the callbacks.
  2. OpenSSL might want to both read and write. Today the code support want to either read or write.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that there is currently an inconsistency between TLS and non-TLS connections regarding epoll registration. For non-TLS connections, a NULL handler leads to removal from epoll. See connSocketSetWriteHandler. This change more closely aligns the two.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/valkey-io/valkey/blob/unstable/src/tls.c#L1239

int need_read = conn->c.read_handler || (conn->flags & TLS_CONN_FLAG_WRITE_WANT_READ);

Upon clearing TLS_CONN_FLAG_WRITE_WANT_READ updateSSLEvents cannot rewire AE_WRITABLE/AE_READABLE as the WANT state is lost.

My approach would be to move this check from handler to updateSSLEvents.

int need_read = conn->c.read_handler ||                                                                                                                                
                  (conn->c.write_handler &&                                                                                                                              
                   (conn->flags & TLS_CONN_FLAG_WRITE_WANT_READ));                                                                                                       
                                                                                                                                                                         
  int need_write = conn->c.write_handler ||                                                                                                                              
                   (conn->c.read_handler &&                                                                                                                              
                    (conn->flags & TLS_CONN_FLAG_READ_WANT_WRITE)); 

This would fix the loop without rewiring issue as WANT state is not cleared?

OpenSSL might want to both read and write. Today the code support want to either read or write.

Agreed on this code handles it already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Your advise is very reasonable and I've incorporated it along with another change. PTAL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Yair Gottdenker added 2 commits April 20, 2026 22:11
Signed-off-by: Yair Gottdenker <yairg@google.com>
Signed-off-by: Yair Gottdenker <yairg@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants