feat(BA-3436): Implement Blue-Green deployment strategy#10050
Draft
jopemachine wants to merge 17 commits intoBA-3436-promote-apifrom
Draft
feat(BA-3436): Implement Blue-Green deployment strategy#10050jopemachine wants to merge 17 commits intoBA-3436-promote-apifrom
jopemachine wants to merge 17 commits intoBA-3436-promote-apifrom
Conversation
6722a22 to
e4ce96b
Compare
e3eca7a to
97874ea
Compare
8a02693 to
5ba1db6
Compare
Fixes two latent bugs plus the matching test-side drift: 1. blue_green.py route classification used `RouteStatus.HEALTHY`/`UNHEALTHY` and `status.is_provisioning()`, neither of which exist on the current `RouteStatus` enum (these live on `RouteHealthStatus`). Every evaluation cycle crashed with AttributeError, so blue-green deployments got stuck in DEPLOYING forever. Use `RouteHealthStatus` for health checks and `RouteStatus.PROVISIONING` for the lifecycle check. 2. Manual `promoteDeployment` went through `apply_strategy_mutations` with `completed_ids`, which cleared sub_step but left lifecycle_stage at DEPLOYING. The FSM coordinator only runs the DEPLOYING → READY transition after a handler returns success; the manual path never reaches a handler, so deployments got stuck post-promote. Introduce a dedicated `complete_manual_promote` db_source method that performs the route mutations, revision swap, sub_step clear, and lifecycle transition atomically in one transaction. 3. test_blue_green.py was written against an older, merged `RouteStatus` enum (with HEALTHY/UNHEALTHY/DEGRADED members). Update the helper and call sites to use the split `RouteStatus` + `RouteHealthStatus` model. `RollingUpdateSpec` now takes `IntOrPercent` fields, so pass those instead of raw ints. Verified end-to-end via `./bai` against a live manager: created a BLUE_GREEN deployment, added a v2 revision, observed PROVISIONING → AWAITING_PROMOTION with old traffic still active, called the promoteDeployment mutation, and confirmed READY with the revision swapped and blue routes draining.
- Guard None deploying_revision_id in _build_route_creators - Use RouteStatus.RUNNING + RouteHealthStatus.HEALTHY for promote classification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract (RouteStatus.RUNNING + RouteHealthStatus.HEALTHY) check into a reusable property on RouteInfo to express the "ready for traffic" concept. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jopemachine
commented
Apr 16, 2026
jopemachine
commented
Apr 16, 2026
- Clarify is_ready docstring: traffic_status intentionally excluded because blue-green green routes are INACTIVE until promotion - Rename _build_route_creators → _build_green_route_creators for clarity Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Routes in RUNNING state with NOT_CHECKED or DEGRADED health_status were not classified into any bucket, causing total_green_running=0 and triggering duplicate route creation every coordinator cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stuck The evaluator's _check_completed path executed promotion in the applier but the handler didn't recognize DEPLOYING_COMPLETED as a success state, leaving the deployment stuck in deploying with empty sub_step. All promotions (auto and manual) now go through AwaitingPromotionHandler, which properly transitions to READY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…strict Return-outside-async-with was insufficient to satisfy CI's strict mypy (func-returns-value at GroupRow() / LegacyEndpointCreatorSpec() call sites). Switching to the standard AsyncGenerator[T, None] + yield pattern makes the return path unambiguous to mypy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…egy FSM Previously create_deployment called activate_revision on the initial revision, which set deploying_revision and drove the blue-green (or rolling) strategy FSM even though there was no prior "blue" revision to swap from. The result: a fresh blue-green deployment created INACTIVE green routes and required a promotion step to serve traffic, with current_revision staying None during the DEPLOYING phase. The strategy FSM is an *update* path — it only makes sense when there is an existing revision serving traffic. For the initial deployment we now: - add the revision to the revision library - set current_revision directly on the endpoint - trigger check_pending so the normal scheduler provisions ACTIVE routes for the current_revision If no initial_revision is provided the deployment stays idle with current_revision=None and creates no routes — the caller must later add a revision and call activate_revision explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The base branch BA-3436-promote-api has TestPromoteDeployment tests that assert on the swapped-row count returned by promote_deployment. Commit cec6c8d changed the return type to None, breaking both mypy (func-returns-value at `swapped = await ...`) and the assertions (`assert swapped == 1`) once merged with base. Propagate the swap_query rowcount through complete_manual_promote and promote_deployment; the two callers (service.py, deploying.py) ignore the return value so behavior is unchanged for auto-promote. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves BA-3436
Summary
BlueGreenStrategy.evaluate_cycle()FSM with 8-step flow: classify routes → create green (INACTIVE) → wait provisioning → rollback if all failed → wait healthy → manual promote gate → promote delay check → atomic promotion (green→ACTIVE, blue→TERMINATING)promote_route_idstoRouteChangesfor blue-green traffic switch, alongside existingrollout_specs(create) anddrain_route_ids(terminate)BatchUpdater(traffic_status→ACTIVE), repository, and db_source transactioncreated_atas conservative proxy for promote delay timing untilstatus_updated_atis added toRouteInfoKey Design Decisions
traffic_status=INACTIVEandtraffic_ratio=0.0so they don't receive traffic until explicitly promotedapply_strategy_mutations, promote executes before drain to avoid a window where no routes serve trafficTest Plan
test_blue_green.py— 14 unit tests covering all 8 FSM stepstest_applier.py— existing applier tests pass withpromoteparameter addition🤖 Generated with Claude Code
📚 Documentation preview 📚: https://sorna--10050.org.readthedocs.build/en/10050/
📚 Documentation preview 📚: https://sorna-ko--10050.org.readthedocs.build/ko/10050/