Skip to content

Fix connect attempt retries#13102

Open
masaori335 wants to merge 1 commit intoapache:masterfrom
masaori335:asf-master-fix-connect-attempt-retry
Open

Fix connect attempt retries#13102
masaori335 wants to merge 1 commit intoapache:masterfrom
masaori335:asf-master-fix-connect-attempt-retry

Conversation

@masaori335
Copy link
Copy Markdown
Contributor

@masaori335 masaori335 commented Apr 20, 2026

Summary

We have three configs of connect attempt retry. However, prior to the change, none of them are implemented correctly. This PR makes the retry path honor the HostDBInfo state across the board.

  • connect_attempts_max_retries
  • connect_attempts_max_retries_down_server
  • connect_attempts_rr_retries

Changes

  • New helper HttpTransact::origin_server_connect_attempts_max_retries(State*) returns the retry limit based on the active HostDBInfo::State :

    • UP → connect_attempts_max_retries
    • SUSPECT → connect_attempts_max_retries_down_server
    • DOWN → 0
  • HttpTransact::handle_response_from_server is refactored and retry logic is fixed.

  • Fix round-robin logic in ResolveInfo::select_next_rr(now, fail_window)

  • HttpSM::mark_host_failure now uses origin_server_connect_attempts_max_retries(s) + 1 as the DOWN threshold (was incorrectly using connect_attempts_rr_retries).

  • HttpSM::do_hostdb_update_if_necessary records failure time via ts_clock::now() instead of client_request_time so multiple connect attempts in one transaction get distinct timestamps.

Tests

  1. New connect_attempts_single_max_retries.replay.yaml exercises connect_attempts_max_retries and connect_attempts_max_retries_down_server for single DNS Record (no round-robin case)
  2. Updated connect_attempts_rr_retries.replay.yaml and gold files to reflect the corrected RR + SUSPECT behavior.
  3. Updated connect_attempts_rr_max_retries gold to match the new state-aware fail-count threshold.

Dependency

This PR is in draft until below PRs are merged.

@masaori335 masaori335 added this to the 11.0.0 milestone Apr 20, 2026
@masaori335 masaori335 self-assigned this Apr 20, 2026
@masaori335 masaori335 removed this from ATS v10.2.x Apr 20, 2026
@masaori335 masaori335 force-pushed the asf-master-fix-connect-attempt-retry branch 2 times, most recently from d9fcfa6 to c033826 Compare April 22, 2026 04:48
@masaori335 masaori335 force-pushed the asf-master-fix-connect-attempt-retry branch from c033826 to ab4427d Compare April 23, 2026 01:19
@masaori335 masaori335 marked this pull request as ready for review April 23, 2026 01:20
@masaori335
Copy link
Copy Markdown
Contributor Author

Now this PR is ready. I removed dependency of #13083. We can do it independently.

@masaori335 masaori335 added the Bug label Apr 23, 2026
@bryancall bryancall requested review from bneradt and Copilot April 27, 2026 22:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes origin connect-attempt retry behavior so it consistently honors the active HostDBInfo health state (UP / SUSPECT / DOWN), and updates AuTest replay coverage to validate the corrected single-host and round-robin behavior.

Changes:

  • Add HttpTransact::origin_server_connect_attempts_max_retries(State*) and refactor server connect failure handling to use state-aware retry limits.
  • Fix round-robin selection to skip DOWN targets while allowing SUSPECT targets (ResolveInfo::select_next_rr(now, fail_window)), and update HostDB failure timestamping.
  • Add/adjust gold tests to cover single-record and RR retry behavior and updated failure-count thresholds.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/proxy/http/HttpTransact.cc Adds state-aware retry helper; refactors retry path and RR switching logic.
include/proxy/http/HttpTransact.h Declares new helper and updates retry helper signature.
src/proxy/http/HttpSM.cc Updates HostDB failure timestamping (ts_clock::now()), adjusts DOWN threshold logic.
src/proxy/http/HttpConfig.cc Adds a config Warning related to RR + down-server retry configuration.
src/iocore/hostdb/HostDB.cc Updates RR selection to be time/window aware and skip DOWN targets.
include/iocore/hostdb/HostDBProcessor.h Updates ResolveInfo::select_next_rr signature to accept (now, fail_window).
tests/gold_tests/dns/replay/connect_attempts_single_max_retries.replay.yaml New replay to cover single-DNS-record retry/down-server behavior.
tests/gold_tests/dns/replay/connect_attempts_rr_retries.replay.yaml Updates RR retry scenario to match corrected behavior.
tests/gold_tests/dns/gold/connect_attempts_single_max_retries_error_log.gold New gold output for the single-record replay.
tests/gold_tests/dns/gold/connect_attempts_rr_retries_error_log.gold Updates gold output to reflect corrected RR behavior.
tests/gold_tests/dns/gold/connect_attempts_rr_max_retries_error_log.gold Updates gold output for new fail-count threshold.
tests/gold_tests/dns/connect_attempts.test.py Registers the new replay test.
tests/gold_tests/autest-site/verifier_client.test.ext Adds poll_timeout support to verifier-client process config for replays.

Comment on lines +586 to +593
switch (active->state(ts_clock::now(), s->txn_conf->down_server_timeout)) {
case HostDBInfo::State::UP:
return s->txn_conf->connect_attempts_max_retries;
case HostDBInfo::State::DOWN:
return 0;
case HostDBInfo::State::SUSPECT:
return s->txn_conf->connect_attempts_max_retries_down_server;
default:
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin_server_connect_attempts_max_retries() returns uint8_t, but the underlying configs (txn_conf->connect_attempts_max_retries* are MgmtInt). Implicitly converting can truncate/wrap values >255, changing retry behavior silently. Consider clamping to [0, UINT8_MAX] (and optionally emitting a Warning when clamped) or changing the helper to return a wider type and only clamp at the HostDBInfo API boundary.

Copilot uses AI. Check for mistakes.
Comment on lines +4010 to +4017
unsigned max_connect_retries = s->txn_conf->connect_attempts_max_retries;
TxnDbg(dbg_ctl_http_trans, "max_connect_retries: %d s->current.retry_attempts: %d", max_connect_retries,
s->current.retry_attempts.get());

if (is_request_retryable(s) && s->current.retry_attempts.get() < max_connect_retries &&
!HttpTransact::is_response_valid(s, &s->hdr_info.server_response)) {
// If this is a round robin DNS entry & we're tried configured
// number of times, we should try another node
if (ResolveInfo::OS_Addr::TRY_CLIENT == s->dns_info.os_addr_style) {
// attempt was based on client supplied server address. Try again using HostDB.
// Allow DNS attempt
s->dns_info.resolved_p = false;
// See if we can get data from HostDB for this.
s->dns_info.os_addr_style = ResolveInfo::OS_Addr::TRY_HOSTDB;
// Force host resolution to have the same family as the client.
// Because this is a transparent connection, we can't switch address
// families - that is locked in by the client source address.
ats_force_order_by_family(s->current.server->dst_addr.family(), s->my_txn_conf().host_res_data.order);
return CallOSDNSLookup(s);
} else if (ResolveInfo::OS_Addr::USE_API == s->dns_info.os_addr_style && !s->api_server_addr_set_retried) {
// Plugin set the server address via TSHttpTxnServerAddrSet(). Clear resolution
// state to allow the OS_DNS hook to be called again, giving the plugin a chance
// to set a different server address for retry (issue #12611).
// Only retry once to avoid infinite loops if the plugin keeps setting failing addresses.
s->api_server_addr_set_retried = true;
s->dns_info.resolved_p = false;
s->dns_info.os_addr_style = ResolveInfo::OS_Addr::TRY_DEFAULT;
// Clear the server request so it can be rebuilt for the new destination
s->hdr_info.server_request.destroy();
TxnDbg(dbg_ctl_http_trans, "Retrying with plugin-set address, returning to OS_DNS hook");
return CallOSDNSLookup(s);
} else {
if ((s->txn_conf->connect_attempts_rr_retries > 0) &&
((s->current.retry_attempts.get() + 1) % s->txn_conf->connect_attempts_rr_retries == 0)) {
s->dns_info.select_next_rr();
}
retry_server_connection_not_open(s, s->current.state, max_connect_retries);
TxnDbg(dbg_ctl_http_trans, "Error. Retrying...");
s->next_action = how_to_open_connection(s);
}
} else {
// Bail out if the request is not retryable, the global retry cap is reached, or we already have a usable response.
if (!is_request_retryable(s) || s->current.retry_attempts.get() >= max_connect_retries ||
HttpTransact::is_response_valid(s, &s->hdr_info.server_response)) {
TxnDbg(dbg_ctl_http_trans, "Error. No more retries. %d/%d", s->current.retry_attempts.get(), max_connect_retries);
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry cap here is initialized from connect_attempts_max_retries unconditionally, but the PR description says retry limits should be state-aware (UP/SUSPECT/DOWN). Using the UP config for the early bailout check can allow extra retries (or plugin/client-address re-resolution retries) beyond the intended per-host cap when the active HostDBInfo is SUSPECT/DOWN. Consider initializing max_connect_retries using origin_server_connect_attempts_max_retries(s) (with a deliberate fallback for non-HostDB targets, if needed).

Copilot uses AI. Check for mistakes.
Comment on lines +4059 to +4073
if (!s->dns_info.select_next_rr(ts_clock::now(), s->txn_conf->down_server_timeout)) {
TxnDbg(dbg_ctl_http_trans, "No round-robin targets available. Giving up");
SET_VIA_STRING(VIA_DETAIL_SERVER_CONNECT, VIA_DETAIL_SERVER_FAILURE);
handle_server_connection_not_open(s);
break;
}

// select_next_rr() only updates dns_info.active; change the dst_addr too.
s->dns_info.addr.assign(s->dns_info.active->data.ip);
s->server_info.dst_addr.assign(s->dns_info.active->data.ip, s->server_info.dst_addr.network_order_port());
if (dbg_ctl_http_trans.on()) {
ip_port_text_buffer addrbuf;
TxnDbg(dbg_ctl_http_trans, "switched to next round-robin upstream addr=%s",
ats_ip_nptop(&s->server_info.dst_addr.sa, addrbuf, sizeof(addrbuf)));
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When connect_attempts_rr_retries triggers a round-robin switch, this treats select_next_rr()==false as “no targets available” and gives up. But select_next_rr() can also return false simply because there is no alternate non-DOWN target (e.g., other RR members are DOWN) even though retrying the current target is still allowed. This can prematurely stop retries. Consider continuing with the current active target when select_next_rr() returns false, and only giving up when the overall retry budget is exhausted / there are truly no viable targets.

Suggested change
if (!s->dns_info.select_next_rr(ts_clock::now(), s->txn_conf->down_server_timeout)) {
TxnDbg(dbg_ctl_http_trans, "No round-robin targets available. Giving up");
SET_VIA_STRING(VIA_DETAIL_SERVER_CONNECT, VIA_DETAIL_SERVER_FAILURE);
handle_server_connection_not_open(s);
break;
}
// select_next_rr() only updates dns_info.active; change the dst_addr too.
s->dns_info.addr.assign(s->dns_info.active->data.ip);
s->server_info.dst_addr.assign(s->dns_info.active->data.ip, s->server_info.dst_addr.network_order_port());
if (dbg_ctl_http_trans.on()) {
ip_port_text_buffer addrbuf;
TxnDbg(dbg_ctl_http_trans, "switched to next round-robin upstream addr=%s",
ats_ip_nptop(&s->server_info.dst_addr.sa, addrbuf, sizeof(addrbuf)));
}
if (s->dns_info.select_next_rr(ts_clock::now(), s->txn_conf->down_server_timeout)) {
// select_next_rr() only updates dns_info.active; change the dst_addr too.
s->dns_info.addr.assign(s->dns_info.active->data.ip);
s->server_info.dst_addr.assign(s->dns_info.active->data.ip, s->server_info.dst_addr.network_order_port());
if (dbg_ctl_http_trans.on()) {
ip_port_text_buffer addrbuf;
TxnDbg(dbg_ctl_http_trans, "switched to next round-robin upstream addr=%s",
ats_ip_nptop(&s->server_info.dst_addr.sa, addrbuf, sizeof(addrbuf)));
}
} else {
TxnDbg(dbg_ctl_http_trans, "No alternate round-robin target available, retrying current upstream");
}

Copilot uses AI. Check for mistakes.
Comment thread src/proxy/http/HttpSM.cc
Comment on lines +5980 to +5986
uint8_t max_connect_retries = HttpTransact::origin_server_connect_attempts_max_retries(&t_state);
ts_seconds fail_window = t_state.txn_conf->down_server_timeout;

// Mark the host DOWN only after every attempt has failed. `max_connect_retries` counts only "retries", so the total attempt
// budget is `max_connect_retries + 1` (the initial connect plus each retry).
auto [down, fail_count] = info->active->increment_fail_count(time_down, max_connect_retries + 1, fail_window);

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_connect_retries is a uint8_t and is used as max_connect_retries + 1 for the attempt budget. If max_connect_retries is 255 (or has already wrapped due to narrowing), + 1 overflows back to 0, which would make the down-threshold computation incorrect. Consider computing the attempt budget in a wider integer type and saturating/clamping to UINT8_MAX before passing it to increment_fail_count().

Copilot uses AI. Check for mistakes.
Comment on lines +1352 to +1353
Warning("connect_attempts_max_retries_down_server=0 with round-robin enabled skips probing recovering (SUSPECT) origins; "
"set connect_attempts_max_retries_down_server >= 1 is recommended");
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Warning message is misleading: connect_attempts_max_retries_down_server=0 doesn’t inherently “skip probing” SUSPECT origins; it mainly reduces the retry budget (potentially to just a single probe attempt). Also the phrasing is grammatically off (“set … is recommended”) and it would be clearer to reference the full record name (proxy.config.http.connect_attempts_max_retries_down_server). Please update the message (and/or the logic) so it accurately reflects the actual behavior.

Suggested change
Warning("connect_attempts_max_retries_down_server=0 with round-robin enabled skips probing recovering (SUSPECT) origins; "
"set connect_attempts_max_retries_down_server >= 1 is recommended");
Warning("proxy.config.http.connect_attempts_max_retries_down_server=0 with round-robin enabled leaves no retry budget "
"for down or recovering (SUSPECT) origins beyond the initial attempt; setting "
"proxy.config.http.connect_attempts_max_retries_down_server >= 1 is recommended");

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +61
# First request: ATS attempts 3 times (1 initial + 2 reties) to connect to 0.0.0.1.
# The server is marked as down and ATS returns 502.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “reties” → “retries”.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +100
# ATS attempts 2 times (1 initial + 1 reties) to connect to 0.0.0.1.
# The server is marked as down again and ATS returns 502.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “reties” → “retries”.

Copilot uses AI. Check for mistakes.
proxy-response:
status: 502

# Forth request: within down_server.cache_time=3s so the server is still marked down.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “Forth request” → “Fourth request”.

Suggested change
# Forth request: within down_server.cache_time=3s so the server is still marked down.
# Fourth request: within down_server.cache_time=3s so the server is still marked down.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants