Skip to content

Optimize queue configuration and implement rate limiting for API#640

Merged
lionello merged 5 commits intomainfrom
kevin/rate-limit-mastra
May 7, 2026
Merged

Optimize queue configuration and implement rate limiting for API#640
lionello merged 5 commits intomainfrom
kevin/rate-limit-mastra

Conversation

@KevyVo
Copy link
Copy Markdown
Contributor

@KevyVo KevyVo commented May 7, 2026

This pull request introduces robust rate limit handling and retry logic for all AI/LLM and embedding calls, improves reliability under quota constraints, and adjusts the demo seed data and worker configuration for better quota management.

It seem like the Azure API gives hard limits back in it response which I had taken into account when making these changes. Since these changes only affect Azure, should we toggle this logic only for that cloud?

Here is the data of how these settings came to be:

Hard limits Azure returned                                                                                                                                                                                     
                                                                                                                                                                                                                 
  These all came from the 429 response headers in the worker logs (model deployment chat-default):                                                                                                              
   
  ┌────────────────────────────────────┬──────────────┬────────────────────────────────────────────────────────┐                                                                                                 
  │               Header               │    Value     │                        Meaning                         │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤
  │ x-ratelimit-key                    │ chat-default │ Bucket name — worker + chat agent share this one       │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤
  │ x-ratelimit-limit-requests         │ 6            │ Max requests per request-window                        │                                                                                                 
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤
  │ x-ratelimit-renewalperiod-requests │ 10 (sec)     │ Length of the request-window                           │                                                                                                 
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-limit-tokens           │ 1000         │ Max tokens per token-window                            │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-renewalperiod-tokens   │ 60 (sec)     │ Length of the token-window                             │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-remaining-requests     │ -1 to 0      │ Requests left in current window (negative = overdrawn) │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-remaining-tokens       │ 297–369      │ Tokens left in current window                          │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-reset-requests         │ 60–65 (sec)  │ Time until full request-bucket refill                  │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-reset-tokens           │ 37–42 (sec)  │ Time until full token-bucket refill                    │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ retry-after                        │ 30 (sec)     │ RFC standard, conservative                             │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ retry-after-ms                     │ 542–5881     │ Azure precise — when the next single slot opens        │
  └────────────────────────────────────┴──────────────┴────────────────────────────────────────────────────────┘                                                                                                 
                  
  Derivations                                                                                                                                                                                                    
                  
  Hard ceilings                                                                                                                                                                                                  
   
  ┌───────────────────────────────────┬────────────────┬──────────────────────────────┐                                                                                                                          
  │                                   │     Value      │         Calculation          │
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤
  │ Request rate ceiling              │ 0.6 req/s      │ 6 req ÷ 10s                  │
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤
  │ Token rate ceiling                │ 16.7 tok/s     │ 1000 tok ÷ 60s               │                                                                                                                          
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤                                                                                                                          
  │ Sustained call rate (token-bound) │ ~3.3 calls/min │ 1000 tok/min ÷ ~300 tok/call │                                                                                                                          
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤                                                                                                                          
  │ Burst call rate (request-bound)   │ 6 calls / 10s  │ request bucket capacity      │
  └───────────────────────────────────┴────────────────┴──────────────────────────────┘                                                                                                                          
                  
  Why request-bound, not token-bound, in your logs                                                                                                                                                               
                  
  Looking at the 429s: remaining-requests: -1 while remaining-tokens: 297+. The request bucket emptied first. So burst behavior is gated by request count, but sustained throughput is gated by tokens.          
                  
  Splitting the bucket                                                                                                                                                                                           
                  
  ┌────────────┬─────────────┬───────────────────────────────────────┐                                                                                                                                           
  │    User    │ Allocation  │               Reasoning               │
  ├────────────┼─────────────┼───────────────────────────────────────┤                                                                                                                                           
  │ Chat agent │ 2 req / 10s │ Reserved headroom for interactive use │
  ├────────────┼─────────────┼───────────────────────────────────────┤
  │ Worker     │ 4 req / 10s │ Remaining capacity                    │                                                                                                                                           
  ├────────────┼─────────────┼───────────────────────────────────────┤                                                                                                                                           
  │ Total      │ 6 req / 10s │ Matches Azure ceiling                 │                                                                                                                                           
  └────────────┴─────────────┴───────────────────────────────────────┘                                                                                                                                           
                  
  Worker config math                                                                                                                                                                                             
                  
  ┌───────────────────────────────┬────────┬──────────────────────────────────────────────────────────┐
  │            Setting            │ Value  │                           Math                           │
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤
  │ WORKER_RATE_LIMIT_MAX         │ 4      │ Worker's share of the bucket                             │
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤
  │ WORKER_RATE_LIMIT_DURATION_MS │ 10_000 │ Matches Azure's request-window                           │                                                                                                          
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤                                                                                                          
  │ WORKER_CONCURRENCY            │ 2      │ Less than limiter so no burst; >1 so retries don't block │                                                                                                          
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤                                                                                                          
  │ Time to process 10 items      │ ~25s   │ 10 items ÷ 4 starts/10s × 10s = 25s, plus a small tail   │
  └───────────────────────────────┴────────┴──────────────────────────────────────────────────────────┘                                                                                                          
                  
  Retry math                                                                                                                                                                                                     
                  
  ┌──────────────────────────────┬──────────┬────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────┐                             
  │            Layer             │ Attempts │                          Backoff                           │                                   Why                                   │
  ├──────────────────────────────┼──────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
  │ In-call (withRateLimitRetry) │ 6        │ retry-after-ms (preferred) → exponential 1.5s→30s + jitter │ Catches transient 429s invisibly; Azure's retry-after-ms is usually <6s │
  ├──────────────────────────────┼──────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
  │ BullMQ job                   │ 6        │ Exponential, base 5s                                       │ If in-call retry exhausts, the whole job retries from scratch           │                             
  ├──────────────────────────────┼──────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤                             
  │ Chat stream                  │ 4        │ retry-after-ms → exponential 2s→30s                        │ Restarts the agent stream when 429 hits before any output               │                             
  └──────────────────────────────┴──────────┴────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────┘                             
                  
  Worst-case latency a user could see                                                                                                                                                                            
                  
  For a single classify-item job that hits sustained throttling:                                                                                                                                                 
  - 6 in-call retries × ~5s avg = ~30s before BullMQ takes over
  - 6 BullMQ retries × growing exponential 5s→160s = ~5min more                                                                                                                                                  
  - Hard ceiling: ~5.5 min before a job permanently fails      
                                                                                                                                                                                                                 
  In practice almost everything resolves on the first or second in-call retry within a few seconds.     

Samples Checklist

✅ All good!

lionello and others added 5 commits May 1, 2026 01:40
Updates the system to generate 5 tasks and 5 events instead of 10.

This change simplifies batch processing, reduces resource consumption, and improves performance by handling smaller input sizes. Adjustments include schema validations, prompt instructions, fallback data slicing, and related messaging updates.
Reduces the default number of items in a seed run from 20 to 10, optimizing resource usage during initialization.

Introduces a new utility function to update the total item count for a seed run, enabling dynamic adjustments and improving flexibility in managing seed runs.
Introduces a rate limiter to cap the worker's concurrency at 4
jobs per 10 seconds, ensuring compliance with Azure quota
limits for chat API usage. Adjusts default concurrency from 8
to 2 to align with the updated rate-limiting strategy.

This change prevents excessive API requests that could result
in rate-limiting errors (429) and ensures smoother operation
for both background jobs and interactive chat.
@KevyVo KevyVo requested a review from lionello May 7, 2026 11:00
@KevyVo KevyVo self-assigned this May 7, 2026
@KevyVo KevyVo had a problem deploying to deploy-changed-samples May 7, 2026 11:00 — with GitHub Actions Failure
@KevyVo KevyVo requested a review from raphaeltm May 7, 2026 16:46
Comment thread samples/mastra-extended/app/src/lib/ai.ts
@lionello lionello merged commit 645defe into main May 7, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants