Skip to content

fix: stop retrying 429s caused by Fabric capacity limits#2539

Merged
BrendanWalsh merged 3 commits intomasterfrom
brwals/fix-capacity-429-no-retry
Apr 3, 2026
Merged

fix: stop retrying 429s caused by Fabric capacity limits#2539
BrendanWalsh merged 3 commits intomasterfrom
brwals/fix-capacity-429-no-retry

Conversation

@BrendanWalsh
Copy link
Copy Markdown
Collaborator

Problem

Fabric returns HTTP 429 with body code "CapacityLimitExceeded" when serverless capacity is exhausted. Unlike transient OpenAI rate limits, these are not retryable — the capacity is genuinely full and retrying just causes infinite retry loops and hangs.

The current sendWithRetries in HTTPClients.scala treats all 429s identically, entering an unlimited exponential backoff retry loop regardless of the error type.

Fix

On receiving a 429, the response body is now inspected for the CapacityLimitExceeded code:

  • Capacity-exceeded 429: returned immediately to the caller (no retry)
  • Rate-limit 429: retried with exponential backoff as before (no behavior change)

The response entity is buffered via BufferedHttpEntity before inspection so callers can still read the body after sendWithRetries returns.

Tests

3 new regression tests in VerifySendWithRetries:

Test Validates
429 with CapacityLimitExceeded body is not retried Single request, 429 returned immediately
429 with CapacityLimitExceeded ignores Retry-After header Capacity error takes precedence over Retry-After
429 with non-capacity error body still retries normally Non-capacity 429 bodies do not break existing retry behavior

All 12 tests pass (9 existing + 3 new).

Fabric returns HTTP 429 with body code "CapacityLimitExceeded" when
serverless capacity is exhausted. Unlike transient OpenAI rate limits,
these are non-retryable — retrying causes infinite loops and hangs.

This change inspects 429 response bodies in sendWithRetries. If the
body contains "CapacityLimitExceeded", the response is returned
immediately instead of entering the retry/backoff loop. Normal
rate-limit 429s continue to retry with exponential backoff as before.

The response entity is buffered (BufferedHttpEntity) before inspection
so callers can still read it after the method returns.

Adds 3 regression tests:
- CapacityLimitExceeded 429 returns immediately (no retry)
- CapacityLimitExceeded 429 ignores Retry-After header
- Non-capacity 429 with body still retries normally
Copilot AI review requested due to automatic review settings April 3, 2026 03:20
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 62a486e.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SynapseML’s HTTP retry logic to avoid infinite retry loops when Fabric serverless capacity is exhausted by treating 429 responses with CapacityLimitExceeded as non-retryable, while preserving existing retry behavior for other 429 scenarios.

Changes:

  • Inspect 429 response bodies for CapacityLimitExceeded and return the 429 immediately (no retry).
  • Buffer 429 response entities so the body can be inspected without preventing callers from reading it later.
  • Add regression tests covering capacity-exceeded vs non-capacity 429 behavior (including Retry-After precedence).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala Adds 429 body inspection + buffering to short-circuit retries for Fabric capacity-limit errors.
core/src/test/scala/com/microsoft/azure/synapse/ml/io/split1/VerifySendWithRetries.scala Adds tests validating non-retry behavior for capacity-limit 429s and unchanged retries for other 429s.

Comment on lines +108 to +115
// Buffer the response entity so the body can be inspected
// and still returned to the caller if we don't retry
if (response.getEntity != null) {
response.setEntity(new BufferedHttpEntity(response.getEntity))
}
val bodyStr = Option(response.getEntity)
.flatMap(e => Try(IOUtils.toString(e.getContent, "UTF-8")).toOption)
.getOrElse("")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferedHttpEntity will buffer the entire 429 response payload in memory. If a service returns a large body (or misbehaves), this can create avoidable memory pressure during retries. Consider guarding this with a size cap (e.g., based on Content-Length) or reading only a bounded prefix sufficient to detect the error code, and skip buffering/inspection when the entity is too large.

Suggested change
// Buffer the response entity so the body can be inspected
// and still returned to the caller if we don't retry
if (response.getEntity != null) {
response.setEntity(new BufferedHttpEntity(response.getEntity))
}
val bodyStr = Option(response.getEntity)
.flatMap(e => Try(IOUtils.toString(e.getContent, "UTF-8")).toOption)
.getOrElse("")
val max429InspectionBytes = 1024 * 1024L
val bodyStr = Option(response.getEntity).flatMap { entity =>
val contentLength = entity.getContentLength
if (contentLength >= 0 && contentLength <= max429InspectionBytes) {
// Buffer only small, known-size entities so the body can be
// inspected and still returned to the caller if we don't retry.
response.setEntity(new BufferedHttpEntity(entity))
Option(response.getEntity)
.flatMap(e => Try(IOUtils.toString(e.getContent, "UTF-8")).toOption)
} else {
logWarning(
s"Skipping 429 response body inspection for ${request.getURI} " +
s"because Content-Length=$contentLength exceeds $max429InspectionBytes bytes " +
s"or is unknown")
None
}
}.getOrElse("")

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +115
val bodyStr = Option(response.getEntity)
.flatMap(e => Try(IOUtils.toString(e.getContent, "UTF-8")).toOption)
.getOrElse("")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 429 body inspection reads from e.getContent but does not close the InputStream. Even with a BufferedHttpEntity (likely a ByteArrayInputStream), leaving streams unclosed is a resource-leak pattern. Wrap the read in a construct that always closes the stream (e.g., Using.resource or explicit try/finally).

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +121
if (bodyStr.contains("CapacityLimitExceeded")) {
// Fabric capacity-exceeded 429s are NOT transient rate limits —
// retrying will not help and causes hangs
logWarning(s"Capacity limit exceeded (non-retryable 429) on ${request.getURI}: $bodyStr")
true
} else {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bodyStr.contains("CapacityLimitExceeded") is a brittle detector and could produce false positives/negatives (e.g., the substring appearing in a message, different JSON shapes, casing). Since the codebase already uses JSON parsing (e.g., spray-json in core), consider parsing the response JSON and checking error.code == "CapacityLimitExceeded" explicitly before deciding not to retry.

Copilot uses AI. Check for mistakes.
if (bodyStr.contains("CapacityLimitExceeded")) {
// Fabric capacity-exceeded 429s are NOT transient rate limits —
// retrying will not help and causes hangs
logWarning(s"Capacity limit exceeded (non-retryable 429) on ${request.getURI}: $bodyStr")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs the full 429 response body ($bodyStr) at warning level. Error bodies can contain sensitive identifiers or be unexpectedly large, which can lead to data exposure and log bloat. Prefer logging only the extracted error code (and possibly a truncated message/first N chars) rather than the entire payload.

Suggested change
logWarning(s"Capacity limit exceeded (non-retryable 429) on ${request.getURI}: $bodyStr")
logWarning(s"Capacity limit exceeded (non-retryable 429, code=CapacityLimitExceeded) " +
s"on ${request.getURI}")

Copilot uses AI. Check for mistakes.
Comment on lines +292 to +302
val start = System.currentTimeMillis()
val response = HandlingUtils.sendWithRetries(
client, request, Array(100, 100, 100))
val elapsed = System.currentTimeMillis() - start
val code = response.getStatusLine.getStatusCode
response.close()
client.close()

assert(code === 429, "Capacity-exceeded 429 should be returned immediately, not retried")
assert(requestCount.get() === 1, "Should not retry on CapacityLimitExceeded")
assert(elapsed < 1000, s"Should return immediately, took ${elapsed}ms")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elapsed-time assertions (e.g., < 1000ms) can be flaky under CI contention/GC pauses since they depend on wall-clock timing rather than behavior. Consider removing the timing check (the requestCount assertion already validates no retry) or using a more forgiving bound to reduce intermittent failures.

Copilot uses AI. Check for mistakes.
- Guard BufferedHttpEntity with Content-Length cap (1MB) to avoid
  buffering unexpectedly large 429 response payloads
- Remove full response body from log warning to prevent data exposure
- Remove tight elapsed-time assertion from capacity test (requestCount
  already validates no-retry); widen Retry-After test bound to 4s
The Content-Length guard was skipping body inspection when
getContentLength() returns -1 (chunked/unknown encoding). This meant
capacity-exceeded 429s sent without Content-Length would still retry
forever. Changed guard to only skip when Content-Length is known AND
exceeds the 1MB cap — unknown-length responses are now inspected.

Added regression test for chunked capacity-exceeded 429.
@BrendanWalsh
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.66%. Comparing base (0a2b0b4) to head (62a486e).

Files with missing lines Patch % Lines
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala 71.42% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2539   +/-   ##
=======================================
  Coverage   84.66%   84.66%           
=======================================
  Files         335      335           
  Lines       17747    17753    +6     
  Branches     1595     1615   +20     
=======================================
+ Hits        15025    15031    +6     
  Misses       2722     2722           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@BrendanWalsh BrendanWalsh merged commit ddb25d3 into master Apr 3, 2026
78 checks passed
@BrendanWalsh BrendanWalsh deleted the brwals/fix-capacity-429-no-retry branch April 3, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants