Skip to content

feat(rook): implement routing engine with strategy dispatch and fallback chains#609

Merged
yacosta738 merged 5 commits into
developfrom
feat/rook-586-routing-engine
Apr 20, 2026
Merged

feat(rook): implement routing engine with strategy dispatch and fallback chains#609
yacosta738 merged 5 commits into
developfrom
feat/rook-586-routing-engine

Conversation

@yacosta738
Copy link
Copy Markdown
Contributor

Summary

Implements #586 — Extract a reusable routing and account-pool engine from current runtime capabilities as part of the Corvus Rook v1 milestone (#572).

What changed

  • clients/rook/src/routing/mod.rs — new RoutingEngine with:

    • resolve(logical_model)RoutingDecision (selected account + metadata)
    • Strategy dispatch: Priority, Failover, RoundRobin, WeightedRoundRobin (SWRR)
    • Recursive async fallback chains (pool → pool, route → route) via Box::pin
    • Health-aware filtering: skips disabled and unhealthy accounts
    • Capability constraint filtering
    • Cycle detection (max 8 hops) with RookError::Routing
    • Per-pool RR cursor and SWRR counters in Arc<Mutex<HashMap<PoolId, ...>>>
  • clients/rook/src/registry/mod.rs — adds #[cfg(test)] pub fn db() accessor to expose SqliteDb for adversarial test scenarios that need to bypass FK constraints

Tests

14/14 routing unit tests pass, 0 warnings.

Covers: basic resolution, priority, failover, round-robin, weighted distribution, disabled/unhealthy skipping, capability constraints, pool and route fallback chains, cycle detection, pool-not-found error path.

Design notes

  • RoutingEngine owns a cloned RookRegistry — no circular dependency
  • Health feedback is out-of-band: callers call registry.health().mark_failure() after failure
  • max_retries from RoutingPolicy is NOT enforced by the engine — gateway's responsibility (Add health-aware cooldown, retry, and capability-constrained route resolution #588)
  • select_failover orders by priority (lowest value wins) instead of trusting SQLite insert order

Close #586

…ack chains

- Add RoutingEngine with resolve(), select_from_pool(), try_fallback()
- Support Priority, Failover, RoundRobin, and Weighted (SWRR) strategies
- Handle recursive async fallback chains via Box::pin
- Track RR cursor and SWRR counters in Arc<Mutex<HashMap<PoolId, ...>>>
- Expose SqliteDb from RookRegistry in test cfg for adversarial FK scenarios
- Fix select_failover to order by priority instead of assuming insert order
- All 14 routing tests pass, 0 warnings

Close #586
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough
📝 Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning Title exceeds 72-character limit (79 chars) and uses proper Conventional Commit format with scope; describes main routing engine feature accurately. Shorten title to ≤72 characters while preserving core message, e.g. 'feat(rook): implement routing engine with fallback chains' (58 chars).
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed PR description is comprehensive with Summary, What changed, Tests, and Design notes sections; Related Issues, Tested Information, and Documentation Impact sections align with template.
Linked Issues check ✅ Passed PR fully implements #586 requirements: routing engine with strategy dispatch, fallback chains, health/capability filtering, cycle detection, test DB access, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes directly support #586 scope: new RoutingEngine in routing/mod.rs, test DB accessor in registry/mod.rs, and 14 routing tests cover stated objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rook-586-routing-engine

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/routing/mod.rs`:
- Line 84: weighted_state currently maps PoolId to Vec<i64> (SWRR indices) which
ties state to candidate order; change it to bind state per account id so scores
follow accounts even when candidate lists reorder. Replace
Arc<Mutex<HashMap<PoolId, Vec<i64>>>> with a structure keyed by account
identifier (e.g., Arc<Mutex<HashMap<PoolId, HashMap<AccountId, i64>>>> or
HashMap<PoolId, BTreeMap<AccountId, i64>>) and update all places that read/write
weighted_state (including the blocks around weighted_state declaration and
usages at the regions you noted ~lines 304-309 and 315-326) to look up/update by
AccountId instead of by index, preserving creation/initialization, incrementing,
and wrap logic per account key.
- Around line 861-867: The test currently accepts either "cycle" or "exhausted"
in the error string which can mask regressions; update the assertion so that
after calling engine.resolve("cycle-model").await.unwrap_err() (and matching
RookError::Routing), the error message MUST contain "cycle" (not just
"exhausted"). Locate the test around engine.resolve("cycle-model") and
RookError::Routing and change the assertion to require msg.contains("cycle") (or
otherwise assert the Routing error is the cycle-detection variant) so the
fallback cycle is explicitly verified.
- Around line 240-245: The match over SelectionStrategy currently calls
selection helpers (select_priority, select_failover, select_round_robin,
select_weighted) in ways that can panic on poisoned Mutex/lock states; instead
update those selection functions (and all similar call sites referenced around
the other ranges) to handle PoisonError results from their internal lock
accesses and return a Err(RookError::Routing(...)) propagated back through
select_from_pool (which already returns Result), and change this match site to
propagate that Result rather than assuming success; specifically, modify
select_priority/select_failover/select_round_robin/select_weighted to map lock
poison into RookError::Routing with context and return Result<Account,
RookError>, then have select_from_pool use the returned Result from each helper
and propagate errors upward.
- Around line 708-717: The current assertions around heavy_count and light_count
are too permissive and allow starvation; update the assertions around the
weighted-distribution check (the block using heavy_count and light_count) to
require that heavy_count is within a tighter expected range (e.g., roughly
70–80% of 40, such as 24..=32 or similar) and that light_count is strictly > 0
and within a corresponding lower bound (e.g., 8..=16), so the test fails if the
light account is starved; modify the assert! calls referencing heavy_count and
light_count to enforce these tighter lower/upper bounds and an explicit
assert!(light_count > 0, ...) or combine into a single combined assertion
message to catch regressions in weighted routing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 67e5e470-5d95-4c6d-be13-9477512a0cf8

📥 Commits

Reviewing files that changed from the base of the PR and between 1adbefa and 9a96954.

📒 Files selected for processing (2)
  • clients/rook/src/registry/mod.rs
  • clients/rook/src/routing/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/registry/mod.rs
  • clients/rook/src/routing/mod.rs
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/rook/src/registry/mod.rs
  • clients/rook/src/routing/mod.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/providers/**/*.rs : Implement `Provider` trait in `src/providers/` and register in `src/providers/mod.rs` factory when adding a new provider

Applied to files:

  • clients/rook/src/routing/mod.rs
🔇 Additional comments (1)
clients/rook/src/registry/mod.rs (1)

43-44: Test-only raw DB access is properly gated.

The retained SqliteDb handle and accessor are constrained to #[cfg(test)], so production consumers still cannot bypass the registry services.

Also applies to: 70-79

Comment thread clients/rook/src/routing/mod.rs Outdated
Comment thread clients/rook/src/routing/mod.rs
Comment thread clients/rook/src/routing/mod.rs Outdated
Comment thread clients/rook/src/routing/mod.rs Outdated
- weighted_state: change from Vec<i64> (by index) to HashMap<AccountId, i64>
  so state persists even when candidate order changes
- select_* fns: return Result<Account, RookError> and handle PoisonError
  instead of panicking on lock poisoning
- cycle_detection test: require "cycle" explicitly, not accept "exhausted"
- weighted_distribution test: tighter bounds (heavy: 24-32, light: 1-16)
  to catch regressions

Close #586
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 20, 2026

Deploying corvus with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5dad006
Status: ✅  Deploy successful!
Preview URL: https://29339757.corvus-42x.pages.dev
Branch Preview URL: https://feat-rook-586-routing-engine.corvus-42x.pages.dev

View logs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

✅ Contributor Report

User: @yacosta738
Status: Passed (12/13 metrics passed)

Metric Description Value Threshold Status
PR Merge Rate PRs merged vs closed 91% >= 30%
Repo Quality Repos with ≥100 stars 0 >= 0
Positive Reactions Positive reactions received 11 >= 1
Negative Reactions Negative reactions received 0 <= 5
Account Age GitHub account age 3097 days >= 30 days
Activity Consistency Regular activity over time 108% >= 0%
Issue Engagement Issues with community engagement 0 >= 0
Code Reviews Code reviews given to others 584 >= 0
Merger Diversity Unique maintainers who merged PRs 2 >= 0
Repo History Merge Rate Merge rate in this repo 93% >= 0%
Repo History Min PRs Previous PRs in this repo 264 >= 0
Profile Completeness Profile richness (bio, followers) 90 >= 0
Suspicious Patterns Spam-like activity detection 1 N/A

Contributor Report evaluates based on public GitHub activity. Analysis period: 2025-04-20 to 2026-04-20

- Restore routing module with AccountId-keyed weighted_state
- Restore health service with Clone derive
- Remove synchronous .await from health calls
- Reset to working state after merge conflict resolution
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/routing/mod.rs`:
- Around line 289-294: The code contains a dead re-binding: remove the redundant
no-op statement `let idx = idx;` that follows computing `idx` in the counter
update block (where `counters.entry(pool_id).or_insert(0)`, `idx = *counter_ref
% candidates.len()`, and `*counter_ref = counter_ref.wrapping_add(1)` are used);
simply delete the extra `let idx = idx;` line so the computed `idx` is used
directly.
- Around line 338-347: The loop uses pool_state.get_mut(&account.id).expect(...)
which can panic; instead avoid expect by reusing the mutable reference returned
when the entry was inserted or by using
pool_state.get_mut(&account.id).ok_or(RookError::Routing(...))? to propagate an
error; update the for loop that references account, pool_state, get_mut,
best_score and best_id to either store the &mut score from the insertion step
and modify it directly, or replace the expect with ok_or(...)? so the function
returns a Result rather than panicking.
- Around line 456-470: The three SQLite statements must run on the same
connection because PRAGMA foreign_keys is connection-scoped; change the code to
explicitly acquire a single connection from the pool (e.g., call acquire() on
the pool returned by registry.db().pool()) and then execute the three
sqlx::query(...) calls against that acquired connection (using the same
connection variable when calling .execute(...)), so the PRAGMA OFF, DELETE FROM
provider_pools WHERE id = pool_id.to_string(), and PRAGMA ON are pinned to one
connection and not dispatched on different pooled connections.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 91ddd29f-6aac-4198-ba9c-1529ac632f94

📥 Commits

Reviewing files that changed from the base of the PR and between 9a96954 and 192cdb3.

📒 Files selected for processing (1)
  • clients/rook/src/routing/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/routing/mod.rs
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/rook/src/routing/mod.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/main.rs : Preserve CLI contract unless change is intentional and documented; prefer explicit errors over silent fallback for unsupported critical paths

Applied to files:

  • clients/rook/src/routing/mod.rs

Comment thread clients/rook/src/routing/mod.rs
Comment thread clients/rook/src/routing/mod.rs
Comment thread clients/rook/src/routing/mod.rs
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 3 unresolved review comments.

Files modified:

  • clients/rook/src/routing/mod.rs

Commit: de31c5b7da5b96c8ffb50493557a1aed9c57be67

The changes have been pushed to the feat/rook-586-routing-engine branch.

Time taken: 9m 48s

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@clients/rook/src/routing/mod.rs`:
- Around line 125-137: The current cycle guard in resolve_pool only checks
depth; update resolve_pool (and any callsites like try_fallback) to accept and
thread a mutable visited set (HashSet<PoolId> and/or HashSet<RouteId>) through
recursive calls, check at the start of resolve_pool whether the current pool_id
(or route_id) is already in the set and immediately return a descriptive
RookError (e.g. "fallback cycle: A → B → A") when a re-entry is detected, insert
the id into the set before recursing and remove/clone as appropriate to preserve
callers' state; keep the existing MAX_FALLBACK_DEPTH check as a secondary safety
cap.
- Around line 146-152: The loop over pool.members currently swallows missing
accounts; update the block around self.registry.accounts().get(member_id).await
to emit a tracing::warn! including pool.id (or pool identifier) and member_id
when get(...) returns None so missing/dangling members are observable, and after
the loop check if candidates.is_empty() while pool.members is non-empty and
return a RookError::Routing (distinct from "no healthy candidates") instead of
proceeding; modify references to candidates, pool.members, and
self.registry.accounts().get to implement these changes.
- Around line 193-202: The try_fallback path redundantly re-fetches the pool via
self.registry.pools().get(pool_id).await to read fallback_pool_id; change
try_fallback's signature and callers (e.g., resolve_pool) to accept the
already-fetched Pool (or simply an Option<PoolId> fallback_pool_id) so it can
read pool.fallback_pool_id without another registry lookup, update resolve_pool
to pass the pool or its fallback_pool_id into try_fallback, and remove the extra
call to self.registry.pools().get in try_fallback (also update any other callers
of try_fallback to match the new parameter).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f35395dc-7568-4dcf-9d9c-278c662e4c52

📥 Commits

Reviewing files that changed from the base of the PR and between 192cdb3 and de31c5b.

📒 Files selected for processing (1)
  • clients/rook/src/routing/mod.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cloudflare Pages
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: Focus on Rust idioms, memory safety, and ownership/borrowing correctness.
Flag unnecessary clones, unchecked panics in production paths, and weak error context.
Prioritize unsafe blocks, FFI boundaries, concurrency races, and secret handling.

Files:

  • clients/rook/src/routing/mod.rs
**/*

⚙️ CodeRabbit configuration file

**/*: Security first, performance second.
Validate input boundaries, auth/authz implications, and secret management.
Look for behavioral regressions, missing tests, and contract breaks across modules.

Files:

  • clients/rook/src/routing/mod.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/channels/**/*.rs : Implement `Channel` trait in `src/channels/` with consistent `send`, `listen`, and `health_check` semantics and cover auth/allowlist/health behavior with tests
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/providers/**/*.rs : Implement `Provider` trait in `src/providers/` and register in `src/providers/mod.rs` factory when adding a new provider

Applied to files:

  • clients/rook/src/routing/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/main.rs : Preserve CLI contract unless change is intentional and documented; prefer explicit errors over silent fallback for unsupported critical paths

Applied to files:

  • clients/rook/src/routing/mod.rs
📚 Learning: 2026-02-17T12:31:17.076Z
Learnt from: CR
Repo: dallay/corvus PR: 0
File: clients/agent-runtime/AGENTS.md:0-0
Timestamp: 2026-02-17T12:31:17.076Z
Learning: Applies to clients/agent-runtime/src/**/*.rs : Avoid unnecessary allocations, clones, and blocking operations to maintain performance and efficiency

Applied to files:

  • clients/rook/src/routing/mod.rs

Comment thread clients/rook/src/routing/mod.rs
Comment thread clients/rook/src/routing/mod.rs
Comment thread clients/rook/src/routing/mod.rs
- Remove redundant let idx = idx in select_round_robin
- Replace expect with ok_or in select_weighted to propagate errors
- Use single connection for PRAGMA FK in test (acquire())
- Add visited_pools HashSet to resolve_pool for cycle detection
- Add tracing::warn when missing account in pool_members
- Return error when pool has members but none found
- Pass pre-fetched pool to try_fallback to avoid re-fetch
- Import ProviderPool for try_fallback signature
@yacosta738 yacosta738 force-pushed the feat/rook-586-routing-engine branch from a87fccd to 5dad006 Compare April 20, 2026 17:10
@yacosta738 yacosta738 merged commit d634ae1 into develop Apr 20, 2026
7 checks passed
@yacosta738 yacosta738 deleted the feat/rook-586-routing-engine branch April 20, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant