Skip to content

fix: reset _authentication_state to none#30409

Open
AldoFusterTurpin wants to merge 2 commits intoredpanda-data:devfrom
AldoFusterTurpin:aldo/_authentication_state-to-none
Open

fix: reset _authentication_state to none#30409
AldoFusterTurpin wants to merge 2 commits intoredpanda-data:devfrom
AldoFusterTurpin:aldo/_authentication_state-to-none

Conversation

@AldoFusterTurpin
Copy link
Copy Markdown

@AldoFusterTurpin AldoFusterTurpin commented May 7, 2026

I am not a C++ developer (good memories from college 😆 ), but I was investigating this issue I faced and then I discovered this...

When the schema registry client tries to connect to a broker, it calls connect_with_retries()connect()do_connect(), which opens a TCP+TLS connection and stores it in _fd. Then maybe_authenticate() is called and throws sasl_authentication_failed.

State after auth failure:

_fd = non-null (TLS socket open, TCP alive) → is_valid() = true
_authentication_state = in_progress (never reset to none)

Every subsequent retry (every ~20s):

  • maybe_initialize_connection() checks: is_valid()==true AND needs_authentication()==false (state is in_progress, not none)
  • Returns early. The broker is permanently stuck, it thinks it has a valid connection that doesn't need authentication, but authentication never succeeded.

The TCP connection eventually drops (keepalive timeout). Then is_valid() = false, connect_with_retries() proceeds, allocates a new connection, auth fails again, and the cycle repeats. The broker never successfully authenticates.

What call is missing: the catch block in maybe_authenticate() needs to reset _authentication_state to none so that the next retry actually attempts to reconnect and re-authenticate:

    } catch (...) {
        _authentication_state = auth_state::none;  // allow retry to reconnect and re-authenticate
        vlog(_logger.warn, "Authentication error - {}", std::current_exception());
        throw;
    }

The TLDR; We do the code change so the next call to maybe_initialize_connection() sees needs_authentication()==true and retries the full reconnect+authenticate sequence instead of returning early.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Bug Fixes

  • Fix Kafka client broker getting stuck in in_progress authentication state after a SASL authentication failure, preventing subsequent retries from re-attempting authentication.

Copilot AI review requested due to automatic review settings May 7, 2026 16:10
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 7, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Kafka client broker-connection edge case where a SASL authentication failure could leave remote_broker stuck in an in_progress authentication state, preventing subsequent retries from re-attempting authentication.

Changes:

  • Reset _authentication_state back to auth_state::none when maybe_authenticate() throws, enabling future authentication attempts on retry.

Comment thread src/v/kafka/client/broker.cc
@WillemKauf WillemKauf requested review from dotnwat and mmaslankaprv May 7, 2026 21:06
@dotnwat dotnwat requested review from oleiman and pgellert May 7, 2026 23:19
@dotnwat
Copy link
Copy Markdown
Member

dotnwat commented May 7, 2026

Thanks!

@AldoFusterTurpin AldoFusterTurpin force-pushed the aldo/_authentication_state-to-none branch 3 times, most recently from ea5dcfd to f73b1e2 Compare May 9, 2026 10:11
So the next call to maybe_initialize_connection()
sees needs_authentication()==true and retries the full
reconnect+authenticate sequence instead of returning early
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread src/v/kafka/client/test/reconnect.cc
Comment thread src/v/kafka/client/test/reconnect.cc Outdated
Comment thread src/v/kafka/client/test/reconnect.cc Outdated
Comment thread src/v/kafka/client/test/reconnect.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/v/kafka/client/test/reconnect.cc
Comment thread src/v/kafka/client/test/reconnect.cc
@AldoFusterTurpin AldoFusterTurpin force-pushed the aldo/_authentication_state-to-none branch 2 times, most recently from b32eb45 to 0a3ddf8 Compare May 9, 2026 15:14
1. Connect successfully first so the cluster is running and brokers exist
2. Swap in wrong credentials and force a reconnect via restart(): this causes the existing broker instance to reconnect and hit maybe_authenticate() with bad
credentials
3. After the expected failure, restore correct credentials
4. Dispatch again: this is the moment that exercises the fix: the same broker instance calls maybe_initialize_connection() -> maybe_authenticate(), and with the fix
_authentication_state is none so it retries; without the fix it's in_progress so it skips auth silently and fails
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants