Skip to content

feat(BA-3436): Implement Blue-Green deployment strategy#10050

Draft
jopemachine wants to merge 17 commits intoBA-3436-promote-apifrom
BA-3436_3
Draft

feat(BA-3436): Implement Blue-Green deployment strategy#10050
jopemachine wants to merge 17 commits intoBA-3436-promote-apifrom
BA-3436_3

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented Mar 13, 2026

Resolves BA-3436

Summary

  • Implement BlueGreenStrategy.evaluate_cycle() FSM with 8-step flow: classify routes → create green (INACTIVE) → wait provisioning → rollback if all failed → wait healthy → manual promote gate → promote delay check → atomic promotion (green→ACTIVE, blue→TERMINATING)
  • Add promote_route_ids to RouteChanges for blue-green traffic switch, alongside existing rollout_specs (create) and drain_route_ids (terminate)
  • Wire promote handling through the full stack: evaluator recording, applier BatchUpdater (traffic_status→ACTIVE), repository, and db_source transaction
  • Add comprehensive unit tests (14 cases) covering all FSM branches: no green routes, provisioning wait, all-failed rollback, partial healthy, manual/auto/delayed promote, edge cases
  • Uses created_at as conservative proxy for promote delay timing until status_updated_at is added to RouteInfo

Key Design Decisions

  • INACTIVE green routes: New-revision routes are created with traffic_status=INACTIVE and traffic_ratio=0.0 so they don't receive traffic until explicitly promoted
  • Atomic promotion: On COMPLETED, green routes are promoted (→ACTIVE) and blue routes are drained (→TERMINATING) in a single DB transaction
  • Promote ordering: In apply_strategy_mutations, promote executes before drain to avoid a window where no routes serve traffic

Test Plan

  • test_blue_green.py — 14 unit tests covering all 8 FSM steps
  • test_applier.py — existing applier tests pass with promote parameter addition
  • All deployment coordinator tests pass (8/8)
  • mypy type check passes
  • ruff lint/format passes

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://sorna--10050.org.readthedocs.build/en/10050/


📚 Documentation preview 📚: https://sorna-ko--10050.org.readthedocs.build/ko/10050/

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 13, 2026
@jopemachine jopemachine changed the title feat(BA-3436): Implement Blue-Green deployment strategy feat(BA-3436): Implement Blue-Green deployment strategy (WIP) Mar 13, 2026
@jopemachine jopemachine added this to the 26.3 milestone Mar 13, 2026
@jopemachine jopemachine changed the title feat(BA-3436): Implement Blue-Green deployment strategy (WIP) feat(BA-3436): Implement Blue-Green deployment strategy Mar 13, 2026
@jopemachine jopemachine modified the milestones: 26.3, 26.4 Mar 17, 2026
@jopemachine jopemachine force-pushed the BA-3436_3 branch 2 times, most recently from 6722a22 to e4ce96b Compare March 23, 2026 08:06
@github-actions github-actions bot added comp:common Related to Common component area:docs Documentations labels Mar 23, 2026
@jopemachine jopemachine changed the base branch from main to BA-3436-promote-api March 23, 2026 10:35
@jopemachine jopemachine force-pushed the BA-3436-promote-api branch from e3eca7a to 97874ea Compare April 2, 2026 07:36
@jopemachine jopemachine force-pushed the BA-3436-promote-api branch from 8a02693 to 5ba1db6 Compare April 15, 2026 05:31
Fixes two latent bugs plus the matching test-side drift:

1. blue_green.py route classification used `RouteStatus.HEALTHY`/`UNHEALTHY`
   and `status.is_provisioning()`, neither of which exist on the current
   `RouteStatus` enum (these live on `RouteHealthStatus`). Every evaluation
   cycle crashed with AttributeError, so blue-green deployments got stuck
   in DEPLOYING forever. Use `RouteHealthStatus` for health checks and
   `RouteStatus.PROVISIONING` for the lifecycle check.

2. Manual `promoteDeployment` went through `apply_strategy_mutations` with
   `completed_ids`, which cleared sub_step but left lifecycle_stage at
   DEPLOYING. The FSM coordinator only runs the DEPLOYING → READY
   transition after a handler returns success; the manual path never
   reaches a handler, so deployments got stuck post-promote. Introduce
   a dedicated `complete_manual_promote` db_source method that performs
   the route mutations, revision swap, sub_step clear, and lifecycle
   transition atomically in one transaction.

3. test_blue_green.py was written against an older, merged `RouteStatus`
   enum (with HEALTHY/UNHEALTHY/DEGRADED members). Update the helper and
   call sites to use the split `RouteStatus` + `RouteHealthStatus` model.
   `RollingUpdateSpec` now takes `IntOrPercent` fields, so pass those
   instead of raw ints.

Verified end-to-end via `./bai` against a live manager: created a
BLUE_GREEN deployment, added a v2 revision, observed PROVISIONING →
AWAITING_PROMOTION with old traffic still active, called the
promoteDeployment mutation, and confirmed READY with the revision
swapped and blue routes draining.
jopemachine and others added 3 commits April 15, 2026 18:53
- Guard None deploying_revision_id in _build_route_creators
- Use RouteStatus.RUNNING + RouteHealthStatus.HEALTHY for promote classification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract (RouteStatus.RUNNING + RouteHealthStatus.HEALTHY) check into a
reusable property on RouteInfo to express the "ready for traffic" concept.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/ai/backend/manager/data/deployment/types.py Outdated
Comment thread src/ai/backend/manager/sokovan/deployment/strategy/blue_green.py Outdated
jopemachine and others added 7 commits April 16, 2026 10:43
- Clarify is_ready docstring: traffic_status intentionally excluded because
  blue-green green routes are INACTIVE until promotion
- Rename _build_route_creators → _build_green_route_creators for clarity

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Routes in RUNNING state with NOT_CHECKED or DEGRADED health_status
were not classified into any bucket, causing total_green_running=0
and triggering duplicate route creation every coordinator cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stuck

The evaluator's _check_completed path executed promotion in the applier
but the handler didn't recognize DEPLOYING_COMPLETED as a success state,
leaving the deployment stuck in deploying with empty sub_step.

All promotions (auto and manual) now go through AwaitingPromotionHandler,
which properly transitions to READY.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…strict

Return-outside-async-with was insufficient to satisfy CI's strict mypy
(func-returns-value at GroupRow() / LegacyEndpointCreatorSpec() call sites).
Switching to the standard AsyncGenerator[T, None] + yield pattern makes
the return path unambiguous to mypy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…egy FSM

Previously create_deployment called activate_revision on the initial
revision, which set deploying_revision and drove the blue-green (or
rolling) strategy FSM even though there was no prior "blue" revision
to swap from. The result: a fresh blue-green deployment created INACTIVE
green routes and required a promotion step to serve traffic, with
current_revision staying None during the DEPLOYING phase.

The strategy FSM is an *update* path — it only makes sense when there
is an existing revision serving traffic. For the initial deployment we
now:

- add the revision to the revision library
- set current_revision directly on the endpoint
- trigger check_pending so the normal scheduler provisions ACTIVE
  routes for the current_revision

If no initial_revision is provided the deployment stays idle with
current_revision=None and creates no routes — the caller must later
add a revision and call activate_revision explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The base branch BA-3436-promote-api has TestPromoteDeployment tests
that assert on the swapped-row count returned by promote_deployment.
Commit cec6c8d changed the return type to None, breaking both mypy
(func-returns-value at `swapped = await ...`) and the assertions
(`assert swapped == 1`) once merged with base.

Propagate the swap_query rowcount through complete_manual_promote and
promote_deployment; the two callers (service.py, deploying.py) ignore
the return value so behavior is unchanged for auto-promote.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant