Optimize queue configuration and implement rate limiting for API by KevyVo · Pull Request #640 · DefangLabs/samples

KevyVo · 2026-05-07T11:00:22Z

This pull request introduces robust rate limit handling and retry logic for all AI/LLM and embedding calls, improves reliability under quota constraints, and adjusts the demo seed data and worker configuration for better quota management.

It seem like the Azure API gives hard limits back in it response which I had taken into account when making these changes. Since these changes only affect Azure, should we toggle this logic only for that cloud?

Here is the data of how these settings came to be:

Hard limits Azure returned                                                                                                                                                                                     
                                                                                                                                                                                                                 
  These all came from the 429 response headers in the worker logs (model deployment chat-default):                                                                                                              
   
  ┌────────────────────────────────────┬──────────────┬────────────────────────────────────────────────────────┐                                                                                                 
  │               Header               │    Value     │                        Meaning                         │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤
  │ x-ratelimit-key                    │ chat-default │ Bucket name — worker + chat agent share this one       │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤
  │ x-ratelimit-limit-requests         │ 6            │ Max requests per request-window                        │                                                                                                 
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤
  │ x-ratelimit-renewalperiod-requests │ 10 (sec)     │ Length of the request-window                           │                                                                                                 
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-limit-tokens           │ 1000         │ Max tokens per token-window                            │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-renewalperiod-tokens   │ 60 (sec)     │ Length of the token-window                             │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-remaining-requests     │ -1 to 0      │ Requests left in current window (negative = overdrawn) │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-remaining-tokens       │ 297–369      │ Tokens left in current window                          │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-reset-requests         │ 60–65 (sec)  │ Time until full request-bucket refill                  │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ x-ratelimit-reset-tokens           │ 37–42 (sec)  │ Time until full token-bucket refill                    │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ retry-after                        │ 30 (sec)     │ RFC standard, conservative                             │
  ├────────────────────────────────────┼──────────────┼────────────────────────────────────────────────────────┤                                                                                                 
  │ retry-after-ms                     │ 542–5881     │ Azure precise — when the next single slot opens        │
  └────────────────────────────────────┴──────────────┴────────────────────────────────────────────────────────┘                                                                                                 
                  
  Derivations                                                                                                                                                                                                    
                  
  Hard ceilings                                                                                                                                                                                                  
   
  ┌───────────────────────────────────┬────────────────┬──────────────────────────────┐                                                                                                                          
  │                                   │     Value      │         Calculation          │
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤
  │ Request rate ceiling              │ 0.6 req/s      │ 6 req ÷ 10s                  │
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤
  │ Token rate ceiling                │ 16.7 tok/s     │ 1000 tok ÷ 60s               │                                                                                                                          
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤                                                                                                                          
  │ Sustained call rate (token-bound) │ ~3.3 calls/min │ 1000 tok/min ÷ ~300 tok/call │                                                                                                                          
  ├───────────────────────────────────┼────────────────┼──────────────────────────────┤                                                                                                                          
  │ Burst call rate (request-bound)   │ 6 calls / 10s  │ request bucket capacity      │
  └───────────────────────────────────┴────────────────┴──────────────────────────────┘                                                                                                                          
                  
  Why request-bound, not token-bound, in your logs                                                                                                                                                               
                  
  Looking at the 429s: remaining-requests: -1 while remaining-tokens: 297+. The request bucket emptied first. So burst behavior is gated by request count, but sustained throughput is gated by tokens.          
                  
  Splitting the bucket                                                                                                                                                                                           
                  
  ┌────────────┬─────────────┬───────────────────────────────────────┐                                                                                                                                           
  │    User    │ Allocation  │               Reasoning               │
  ├────────────┼─────────────┼───────────────────────────────────────┤                                                                                                                                           
  │ Chat agent │ 2 req / 10s │ Reserved headroom for interactive use │
  ├────────────┼─────────────┼───────────────────────────────────────┤
  │ Worker     │ 4 req / 10s │ Remaining capacity                    │                                                                                                                                           
  ├────────────┼─────────────┼───────────────────────────────────────┤                                                                                                                                           
  │ Total      │ 6 req / 10s │ Matches Azure ceiling                 │                                                                                                                                           
  └────────────┴─────────────┴───────────────────────────────────────┘                                                                                                                                           
                  
  Worker config math                                                                                                                                                                                             
                  
  ┌───────────────────────────────┬────────┬──────────────────────────────────────────────────────────┐
  │            Setting            │ Value  │                           Math                           │
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤
  │ WORKER_RATE_LIMIT_MAX         │ 4      │ Worker's share of the bucket                             │
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤
  │ WORKER_RATE_LIMIT_DURATION_MS │ 10_000 │ Matches Azure's request-window                           │                                                                                                          
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤                                                                                                          
  │ WORKER_CONCURRENCY            │ 2      │ Less than limiter so no burst; >1 so retries don't block │                                                                                                          
  ├───────────────────────────────┼────────┼──────────────────────────────────────────────────────────┤                                                                                                          
  │ Time to process 10 items      │ ~25s   │ 10 items ÷ 4 starts/10s × 10s = 25s, plus a small tail   │
  └───────────────────────────────┴────────┴──────────────────────────────────────────────────────────┘                                                                                                          
                  
  Retry math                                                                                                                                                                                                     
                  
  ┌──────────────────────────────┬──────────┬────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────┐                             
  │            Layer             │ Attempts │                          Backoff                           │                                   Why                                   │
  ├──────────────────────────────┼──────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
  │ In-call (withRateLimitRetry) │ 6        │ retry-after-ms (preferred) → exponential 1.5s→30s + jitter │ Catches transient 429s invisibly; Azure's retry-after-ms is usually <6s │
  ├──────────────────────────────┼──────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
  │ BullMQ job                   │ 6        │ Exponential, base 5s                                       │ If in-call retry exhausts, the whole job retries from scratch           │                             
  ├──────────────────────────────┼──────────┼────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤                             
  │ Chat stream                  │ 4        │ retry-after-ms → exponential 2s→30s                        │ Restarts the agent stream when 429 hits before any output               │                             
  └──────────────────────────────┴──────────┴────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────┘                             
                  
  Worst-case latency a user could see                                                                                                                                                                            
                  
  For a single classify-item job that hits sustained throttling:                                                                                                                                                 
  - 6 in-call retries × ~5s avg = ~30s before BullMQ takes over
  - 6 BullMQ retries × growing exponential 5s→160s = ~5min more                                                                                                                                                  
  - Hard ceiling: ~5.5 min before a job permanently fails      
                                                                                                                                                                                                                 
  In practice almost everything resolves on the first or second in-call retry within a few seconds.

Samples Checklist

✅ All good!

…ling

Updates the system to generate 5 tasks and 5 events instead of 10. This change simplifies batch processing, reduces resource consumption, and improves performance by handling smaller input sizes. Adjustments include schema validations, prompt instructions, fallback data slicing, and related messaging updates.

Reduces the default number of items in a seed run from 20 to 10, optimizing resource usage during initialization. Introduces a new utility function to update the total item count for a seed run, enabling dynamic adjustments and improving flexibility in managing seed runs.

Introduces a rate limiter to cap the worker's concurrency at 4 jobs per 10 seconds, ensuring compliance with Azure quota limits for chat API usage. Adjusts default concurrency from 8 to 2 to align with the updated rate-limiting strategy. This change prevents excessive API requests that could result in rate-limiting errors (429) and ensures smoother operation for both background jobs and interactive chat.

lionello and others added 5 commits May 1, 2026 01:40

Add QUEUE_PREFIX to queue configuration for consistent Redis key hand…

c1c4a80

…ling

Add rate limit handling and retry logic for API calls

16038cb

KevyVo requested a review from lionello May 7, 2026 11:00

KevyVo self-assigned this May 7, 2026

KevyVo had a problem deploying to deploy-changed-samples May 7, 2026 11:00 — with GitHub Actions Failure

KevyVo requested a review from raphaeltm May 7, 2026 16:46

KevyVo commented May 7, 2026

View reviewed changes

Comment thread samples/mastra-extended/app/src/lib/ai.ts

lionello approved these changes May 7, 2026

View reviewed changes

lionello merged commit 645defe into main May 7, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize queue configuration and implement rate limiting for API#640

Optimize queue configuration and implement rate limiting for API#640
lionello merged 5 commits intomainfrom
kevin/rate-limit-mastra

KevyVo commented May 7, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KevyVo commented May 7, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Samples Checklist

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KevyVo commented May 7, 2026 •

edited by github-actions Bot

Loading