feat(rook): implement routing engine with strategy dispatch and fallback chains#609
Conversation
…ack chains - Add RoutingEngine with resolve(), select_from_pool(), try_fallback() - Support Priority, Failover, RoundRobin, and Weighted (SWRR) strategies - Handle recursive async fallback chains via Box::pin - Track RR cursor and SWRR counters in Arc<Mutex<HashMap<PoolId, ...>>> - Expose SqliteDb from RookRegistry in test cfg for adversarial FK scenarios - Fix select_failover to order by priority instead of assuming insert order - All 14 routing tests pass, 0 warnings Close #586
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 Walkthrough📝 Walkthrough🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clients/rook/src/routing/mod.rs`:
- Line 84: weighted_state currently maps PoolId to Vec<i64> (SWRR indices) which
ties state to candidate order; change it to bind state per account id so scores
follow accounts even when candidate lists reorder. Replace
Arc<Mutex<HashMap<PoolId, Vec<i64>>>> with a structure keyed by account
identifier (e.g., Arc<Mutex<HashMap<PoolId, HashMap<AccountId, i64>>>> or
HashMap<PoolId, BTreeMap<AccountId, i64>>) and update all places that read/write
weighted_state (including the blocks around weighted_state declaration and
usages at the regions you noted ~lines 304-309 and 315-326) to look up/update by
AccountId instead of by index, preserving creation/initialization, incrementing,
and wrap logic per account key.
- Around line 861-867: The test currently accepts either "cycle" or "exhausted"
in the error string which can mask regressions; update the assertion so that
after calling engine.resolve("cycle-model").await.unwrap_err() (and matching
RookError::Routing), the error message MUST contain "cycle" (not just
"exhausted"). Locate the test around engine.resolve("cycle-model") and
RookError::Routing and change the assertion to require msg.contains("cycle") (or
otherwise assert the Routing error is the cycle-detection variant) so the
fallback cycle is explicitly verified.
- Around line 240-245: The match over SelectionStrategy currently calls
selection helpers (select_priority, select_failover, select_round_robin,
select_weighted) in ways that can panic on poisoned Mutex/lock states; instead
update those selection functions (and all similar call sites referenced around
the other ranges) to handle PoisonError results from their internal lock
accesses and return a Err(RookError::Routing(...)) propagated back through
select_from_pool (which already returns Result), and change this match site to
propagate that Result rather than assuming success; specifically, modify
select_priority/select_failover/select_round_robin/select_weighted to map lock
poison into RookError::Routing with context and return Result<Account,
RookError>, then have select_from_pool use the returned Result from each helper
and propagate errors upward.
- Around line 708-717: The current assertions around heavy_count and light_count
are too permissive and allow starvation; update the assertions around the
weighted-distribution check (the block using heavy_count and light_count) to
require that heavy_count is within a tighter expected range (e.g., roughly
70–80% of 40, such as 24..=32 or similar) and that light_count is strictly > 0
and within a corresponding lower bound (e.g., 8..=16), so the test fails if the
light account is starved; modify the assert! calls referencing heavy_count and
light_count to enforce these tighter lower/upper bounds and an explicit
assert!(light_count > 0, ...) or combine into a single combined assertion
message to catch regressions in weighted routing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 67e5e470-5d95-4c6d-be13-9477512a0cf8
📒 Files selected for processing (2)
clients/rook/src/registry/mod.rsclients/rook/src/routing/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.
Files:
clients/rook/src/registry/mod.rsclients/rook/src/routing/mod.rs
**/*
⚙️ CodeRabbit configuration file
**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.
Files:
clients/rook/src/registry/mod.rsclients/rook/src/routing/mod.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/providers/**/*.rs : Implement `Provider` trait in `src/providers/` and register in `src/providers/mod.rs` factory when adding a new provider
Applied to files:
clients/rook/src/routing/mod.rs
🔇 Additional comments (1)
clients/rook/src/registry/mod.rs (1)
43-44: Test-only raw DB access is properly gated.The retained
SqliteDbhandle and accessor are constrained to#[cfg(test)], so production consumers still cannot bypass the registry services.Also applies to: 70-79
- weighted_state: change from Vec<i64> (by index) to HashMap<AccountId, i64> so state persists even when candidate order changes - select_* fns: return Result<Account, RookError> and handle PoisonError instead of panicking on lock poisoning - cycle_detection test: require "cycle" explicitly, not accept "exhausted" - weighted_distribution test: tighter bounds (heavy: 24-32, light: 1-16) to catch regressions Close #586
Deploying corvus with
|
| Latest commit: |
5dad006
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://29339757.corvus-42x.pages.dev |
| Branch Preview URL: | https://feat-rook-586-routing-engine.corvus-42x.pages.dev |
✅ Contributor ReportUser: @yacosta738
Contributor Report evaluates based on public GitHub activity. Analysis period: 2025-04-20 to 2026-04-20 |
- Restore routing module with AccountId-keyed weighted_state - Restore health service with Clone derive - Remove synchronous .await from health calls - Reset to working state after merge conflict resolution
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clients/rook/src/routing/mod.rs`:
- Around line 289-294: The code contains a dead re-binding: remove the redundant
no-op statement `let idx = idx;` that follows computing `idx` in the counter
update block (where `counters.entry(pool_id).or_insert(0)`, `idx = *counter_ref
% candidates.len()`, and `*counter_ref = counter_ref.wrapping_add(1)` are used);
simply delete the extra `let idx = idx;` line so the computed `idx` is used
directly.
- Around line 338-347: The loop uses pool_state.get_mut(&account.id).expect(...)
which can panic; instead avoid expect by reusing the mutable reference returned
when the entry was inserted or by using
pool_state.get_mut(&account.id).ok_or(RookError::Routing(...))? to propagate an
error; update the for loop that references account, pool_state, get_mut,
best_score and best_id to either store the &mut score from the insertion step
and modify it directly, or replace the expect with ok_or(...)? so the function
returns a Result rather than panicking.
- Around line 456-470: The three SQLite statements must run on the same
connection because PRAGMA foreign_keys is connection-scoped; change the code to
explicitly acquire a single connection from the pool (e.g., call acquire() on
the pool returned by registry.db().pool()) and then execute the three
sqlx::query(...) calls against that acquired connection (using the same
connection variable when calling .execute(...)), so the PRAGMA OFF, DELETE FROM
provider_pools WHERE id = pool_id.to_string(), and PRAGMA ON are pinned to one
connection and not dispatched on different pooled connections.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 91ddd29f-6aac-4198-ba9c-1529ac632f94
📒 Files selected for processing (1)
clients/rook/src/routing/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.
Files:
clients/rook/src/routing/mod.rs
**/*
⚙️ CodeRabbit configuration file
**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.
Files:
clients/rook/src/routing/mod.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/main.rs : Preserve CLI contract unless change is intentional and documented; prefer explicit errors over silent fallback for unsupported critical paths
Applied to files:
clients/rook/src/routing/mod.rs
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 1 file(s) based on 3 unresolved review comments. Files modified:
Commit: The changes have been pushed to the Time taken: |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clients/rook/src/routing/mod.rs`:
- Around line 125-137: The current cycle guard in resolve_pool only checks
depth; update resolve_pool (and any callsites like try_fallback) to accept and
thread a mutable visited set (HashSet<PoolId> and/or HashSet<RouteId>) through
recursive calls, check at the start of resolve_pool whether the current pool_id
(or route_id) is already in the set and immediately return a descriptive
RookError (e.g. "fallback cycle: A → B → A") when a re-entry is detected, insert
the id into the set before recursing and remove/clone as appropriate to preserve
callers' state; keep the existing MAX_FALLBACK_DEPTH check as a secondary safety
cap.
- Around line 146-152: The loop over pool.members currently swallows missing
accounts; update the block around self.registry.accounts().get(member_id).await
to emit a tracing::warn! including pool.id (or pool identifier) and member_id
when get(...) returns None so missing/dangling members are observable, and after
the loop check if candidates.is_empty() while pool.members is non-empty and
return a RookError::Routing (distinct from "no healthy candidates") instead of
proceeding; modify references to candidates, pool.members, and
self.registry.accounts().get to implement these changes.
- Around line 193-202: The try_fallback path redundantly re-fetches the pool via
self.registry.pools().get(pool_id).await to read fallback_pool_id; change
try_fallback's signature and callers (e.g., resolve_pool) to accept the
already-fetched Pool (or simply an Option<PoolId> fallback_pool_id) so it can
read pool.fallback_pool_id without another registry lookup, update resolve_pool
to pass the pool or its fallback_pool_id into try_fallback, and remove the extra
call to self.registry.pools().get in try_fallback (also update any other callers
of try_fallback to match the new parameter).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f35395dc-7568-4dcf-9d9c-278c662e4c52
📒 Files selected for processing (1)
clients/rook/src/routing/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.
Files:
clients/rook/src/routing/mod.rs
**/*
⚙️ CodeRabbit configuration file
**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.
Files:
clients/rook/src/routing/mod.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/providers/**/*.rs : Implement `Provider` trait in `src/providers/` and register in `src/providers/mod.rs` factory when adding a new provider
Applied to files:
clients/rook/src/routing/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/main.rs : Preserve CLI contract unless change is intentional and documented; prefer explicit errors over silent fallback for unsupported critical paths
Applied to files:
clients/rook/src/routing/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/**/*.rs : Avoid unnecessary allocations, clones, and blocking operations to maintain performance and efficiency
Applied to files:
clients/rook/src/routing/mod.rs
- Remove redundant let idx = idx in select_round_robin - Replace expect with ok_or in select_weighted to propagate errors - Use single connection for PRAGMA FK in test (acquire()) - Add visited_pools HashSet to resolve_pool for cycle detection - Add tracing::warn when missing account in pool_members - Return error when pool has members but none found - Pass pre-fetched pool to try_fallback to avoid re-fetch - Import ProviderPool for try_fallback signature
a87fccd to
5dad006
Compare
Summary
Implements #586 — Extract a reusable routing and account-pool engine from current runtime capabilities as part of the Corvus Rook v1 milestone (#572).
What changed
clients/rook/src/routing/mod.rs— newRoutingEnginewith:resolve(logical_model)→RoutingDecision(selected account + metadata)Priority,Failover,RoundRobin,WeightedRoundRobin(SWRR)Box::pinRookError::RoutingArc<Mutex<HashMap<PoolId, ...>>>clients/rook/src/registry/mod.rs— adds#[cfg(test)] pub fn db()accessor to exposeSqliteDbfor adversarial test scenarios that need to bypass FK constraintsTests
14/14 routing unit tests pass, 0 warnings.
Covers: basic resolution, priority, failover, round-robin, weighted distribution, disabled/unhealthy skipping, capability constraints, pool and route fallback chains, cycle detection, pool-not-found error path.
Design notes
RoutingEngineowns a clonedRookRegistry— no circular dependencyregistry.health().mark_failure()after failuremax_retriesfromRoutingPolicyis NOT enforced by the engine — gateway's responsibility (Add health-aware cooldown, retry, and capability-constrained route resolution #588)select_failoverorders bypriority(lowest value wins) instead of trusting SQLite insert orderClose #586