Skip to content

feat(mcp): impact_analysis v2 — downstream value_diff + data_impact field#1258

Open
iamcxa wants to merge 8 commits intomainfrom
feat/mcp-impact-analysis-v2
Open

feat(mcp): impact_analysis v2 — downstream value_diff + data_impact field#1258
iamcxa wants to merge 8 commits intomainfrom
feat/mcp-impact-analysis-v2

Conversation

@iamcxa
Copy link
Copy Markdown
Contributor

@iamcxa iamcxa commented Mar 27, 2026

Summary

Extends impact_analysis to run value_diff on downstream table models and adds a data_impact field so agents can distinguish DAG-reachable models from actually data-affected models.

Problem

  • Downstream models listed as confirmed_impacted with value_diff: null — agents can't tell if data actually changed
  • View-modified scenarios (e.g., stg_orders → orders) produce total_affected_row_count=0 because downstream tables skip value_diff
  • _guidance text says "DO NOT OVERRIDE" — forces agents to blindly trust noisy DAG classifications

Changes

Commit Description
184d00e3 Rename response fields: confirmed_impacted_models, confirmed_not_impacted_models, total_affected_row_count
4f698386 Include views in row_count_diff (was incorrectly skipping them)
45010042 Extend value_diff to downstream table models + add data_impact field (confirmed/none/potential) + rewrite _guidance
a5e65b0e Clarify guidance text for skip_downstream_value_diff case
b2d7e9bc Add 7 behavioral tests for data_impact, downstream value_diff, skip flags, and guidance

New data_impact field

Value Meaning Trigger
confirmed Value diff verified data changes exist affected_row_count > 0
none Value diff verified zero data changes affected_row_count == 0
potential No value diff available (views, no PK, skipped) value_diff is null

When data_impact="potential", affected_row_count is forced to null to avoid misleading row_count fallback values.

New skip_downstream_value_diff parameter

  • Default false: run value_diff on ALL impacted table models (modified + downstream)
  • true: skip downstream models (current behavior, opt-out for large DAGs)

Backward compatibility

All changes are additive — no existing fields removed, skip_downstream_value_diff defaults to false, list names unchanged.

Test plan

  • 101 unit tests pass (7 new behavioral tests in TestImpactAnalysisBehavior)
  • Verify data_impact field present on all models in confirmed_impacted_models
  • Verify data_impact="potential" forces affected_row_count=null
  • Verify skip_downstream_value_diff=true skips downstream value_diff
  • Verify skip_value_diff=true takes precedence over skip_downstream_value_diff

🤖 Generated with Claude Code

iamcxa and others added 6 commits March 27, 2026 17:14
…ompliance

Rename response fields to signal authority and match consumer output schema:
- impacted_models → confirmed_impacted_models
- not_impacted_models → confirmed_not_impacted_models
- Add per-model affected_row_count (value_diff total or abs(row_count.delta) fallback)
- Add response-level total_affected_row_count (max across models)
- value_diff.rows_changed → value_diff.affected_row_count (per-column too)

"confirmed_" prefix prevents agents from overriding DAG classifications
with their own analysis. Pre-computed affected_row_count eliminates
semantic translation errors (agent copies directly instead of interpreting).

Eval result (ch3-phantom-filter, Sonnet, bare mode):
  Before: 8/12 (agent overrides 3 downstream models + wrong row count)
  After:  12/12 (agent copies confirmed lists + total_affected_row_count)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
Views were skipped from row_count_diff, causing affected_row_count
to be null/0 for view-only changes. Views support SELECT COUNT(*)
and their row count delta is essential for detecting filtered rows
(e.g., WHERE clause dropping rows from a staging view).

Only value_diff (PK Join) should skip views (expensive). Row count
comparison is cheap and should always run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…_impact field

- Add skip_downstream_value_diff parameter (opt-out for large DAGs)
- Remove downstream skip in value_diff loop (table models now compared)
- Add data_impact field: confirmed/none/potential per model
- Force affected_row_count=null when data_impact=potential
- Rewrite _guidance from prescriptive to descriptive

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
… case

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…f, and guidance

Adds TestImpactAnalysisBehavior covering: all models have data_impact field,
confirmed/none/potential classification logic, null affected_row_count for
potential models, guidance text rules, skip_downstream_value_diff, and
skip_value_diff precedence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
@even-wei
Copy link
Copy Markdown
Contributor

Design Proposal: impact_analysis as Incremental Triage

@iamcxa — Great foundation with the data_impact field and skip_downstream_value_diff. Before we merge, I want to propose a design direction that builds on your work and makes impact_analysis more effective for agent-driven PR validation.

Core Principles

  1. Start cheap, follow the DAG — Check modified models first (using state:modified.body state:modified.macros state:modified.contract for precise selection — not state:modified which includes cosmetic changes). Check terminal consumer nodes last.
  2. Metadata before data — Row count + schema are free signals. Data-level queries (query_diff, value_diff) are expensive. Only run them when metadata signals warrant investigation.
  3. impact_analysis = triage, not results — Its job is to point out what to check, not do the checking. Keep it light. Let agents call targeted tools based on triage signals.
  4. Row count ≠ signal of correctness — Row count delta direction depends entirely on the PR's intent. A WHERE clause change should change row count. A calculation fix shouldn't. Triage reports the signal without judging it. Priority is driven by code change type (body/macros/contract), not data change direction.

Three-Layer Validation Pyramid

           ┌─────────────┐
           │  value_diff  │  Expensive: full row-level JOIN
           │  profile_diff│  Only 1-2 models reach here
           └──────┬───────┘
              ┌───┴────┐
              │ query_ │  Cheap: targeted aggregate query
              │  diff  │  Filters out models with no real change
              └───┬────┘
          ┌───────┴────────┐
          │ impact_analysis │  Free: metadata only (row_count + schema)
          │    (triage)     │  All impacted models
          └────────────────┘
Layer Tool Cost Purpose
1. Triage impact_analysis ~0 (metadata) Classify all models, suggest what to check
2. Verify query_diff 1 aggregate query/model Confirm whether data actually changed
3. Investigate value_diff / profile_diff Full scan Understand how data changed, row by row

Most models get filtered out at each layer. The expensive step only runs on the 1-2 models that actually need it.

Key change: impact_analysis suggests the verification query

Instead of running value_diff itself, impact_analysis returns a per-model next_action with a suggested query for query_diff. The agent follows the suggestion directly — no SQL construction needed:

{
  "name": "orders",
  "change_status": "modified",
  "materialized": "table",
  "row_count": {"base": 100, "current": 100, "delta": 0},
  "schema_changes": [],
  "next_action": {
    "tool": "query_diff",
    "query": "SELECT COUNT(*), SUM(amount), AVG(amount) FROM {{ orders }}",
    "reason": "body changed, row_count stable — verify values unchanged",
    "priority": "high"
  }
}

impact_analysis can build the suggested query from lineage metadata — it knows each model's columns, picks numeric columns for SUM()/AVG(), categorical for COUNT(DISTINCT ...). If query_diff shows divergence, the agent escalates to value_diff on that specific model.

Per-model next_action priority logic

Priority is driven by code change type, not data delta direction:

modified model + schema changes           → profile_diff(changed_cols), high
modified model + row_count stable         → query_diff (verify no silent value changes), high
modified model + row_count changed        → query_diff (verify shape), high
downstream table                          → query_diff, medium
downstream view                           → query_diff or skip, low
any model + no PK (if escalation needed)  → profile_diff, medium

Agent interaction flow

Agent → impact_analysis()
       ← "orders: body changed, row_count stable
           next_action: query_diff with suggested query (high)
          customers: downstream, row_count stable
           next_action: query_diff with suggested query (medium)
          daily_view: downstream view
           next_action: skip (low)"

Agent → query_diff(orders, suggested_query)
       ← base: (100, 5000, 50.0)  current: (100, 5230, 52.3)
          "SUM(amount) +4.6%, AVG(amount) +4.6%"

Agent → "orders aggregates diverge — drilling into row-level diff"

Agent → value_diff(orders)
       ← "5 rows changed in amount column"

Agent → "PR impact: orders calculation fix affects 5 rows,
          SUM(amount) increased 4.6%. customers unchanged."

Each agent action is justified by the previous result, not a bulk dump.

What to keep from this PR

  • data_impact field (confirmed/none/potential) — useful after agent runs verification
  • affected_row_count in value_diff response — good field addition
  • ✅ Views included in row_count — useful metadata signal
  • ✅ Behavioral test structure — adapt to new design

What to change

  • Remove value_diff from the default impact_analysis path
  • Remove suggested_deep_dives (replaced by next_action)
  • Fix naming: total_affected_row_count → rename or remove (currently computing max, not total)
  • Fix E2E tests (test_mcp_e2e.py) for field renames
  • Update _guidance to teach the triage → verify → investigate workflow

Suggested PR split

  1. PR A (this PR, slim down): Field renames + data_impact + E2E fixes + remove value_diff from default path
  2. PR B (follow-up): Per-model next_action with suggested query_diff queries, updated guidance, interaction sequence tests

Thoughts?

@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 31, 2026

Quick Validation: Agent Turn Budget Comparison

Before deciding the architecture direction, here's a concrete side-by-side of how each approach affects the Haiku subagent's workflow for a typical scenario (2 modified + 1 downstream model).

Scenario

  • orders: modified, has PK, row_count stable
  • customers: modified, schema change (new column), row_count stable
  • revenue_report: downstream, row_count stable

Approach A — Current design (value_diff inline)

Call 1: impact_analysis()
        → orders: data_impact="confirmed", 5 rows changed, amount +4.6%
        → customers: data_impact="none", 0 rows changed, email column added
        → revenue_report: data_impact="potential"

Call 2: profile_diff(revenue_report)     ← follow suggested_deep_dives
Call 3: list_checks + run_check
Call 4: create_check (×N)

Agent has 80% of the analysis conclusion after Call 1. The mechanical work (lineage classification, row_count, schema_diff, value_diff) is all done by the tool. Agent only needs to interpret and report.

Approach B — Proposal (next_action + query_diff pyramid)

Call 1: impact_analysis()
        → orders: data_impact="potential", next_action: query_diff(suggested SQL)
        → customers: data_impact="potential", next_action: profile_diff
        → revenue_report: data_impact="potential", next_action: query_diff(suggested SQL)

Call 2: query_diff(orders, suggested SQL)        ← agent follows next_action
Call 3: query_diff(revenue_report, suggested SQL) ← agent follows next_action
Call 4: profile_diff(customers)                   ← agent follows next_action
Call 5: value_diff(orders)                        ← agent drills down on divergence
Call 6: list_checks + run_check
Call 7: create_check (×N)

Agent needs 3-4 extra calls to reach the same conclusion Approach A gives in Call 1.

The key question

The original goal of impact_analysis is to extract mechanical steps from the Haiku subagent — reducing both the prompt complexity and the agent's opportunity to make mistakes (wrong model selection, wrong SQL, wrong interpretation of table_not_found, etc.).

Looking at the current agent.ts prompt (~250 lines), large sections are mechanical instructions that exist because the agent must orchestrate individual tools:

Prompt section Lines Needed if impact_analysis handles it?
Model Selection for Data Queries ~25 ❌ impact_analysis already filters
table_not_found handling ~20 ❌ impact_analysis handles errors internally
Data Verification Before Reporting ~15 ❌ impact_analysis returns verified data
Workflow steps 1-4 ~15 ❌ replaced by single impact_analysis call

With inline value_diff (Approach A): ~75 lines of mechanical prompts become unnecessary. Haiku just interprets results.

With pyramid (Approach B): Agent still needs query_diff orchestration prompts, SQL template handling, and divergence-based escalation logic — the mechanical prompt burden shifts but doesn't shrink.

Existing escape hatches

skip_value_diff=True and skip_downstream_value_diff=True already handle the "large DAG" performance concern without removing value_diff from the default path.

Suggestion

Keep value_diff inline (maximizes mechanical extraction). Replace suggested_deep_dives with next_action (better UX for remaining agent decisions). This gives the best of both: impact_analysis handles all mechanical work, agent prompt stays minimal.

Running a real A/B test next to confirm with actual data.

@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 31, 2026

A/B Test Results + Design Direction Analysis

Real A/B Test (DuckDB, 3-model DAG)

Ran impact_analysis against a realistic scenario: customers (modified + schema change), orders (modified, values changed), revenue_report (downstream, unchanged).

Metric A (value_diff inline) B (skip_value_diff)
Models with definitive answer after 1 call 2/3 0/3
Models needing further investigation 1/3 3/3
Agent knows "confirmed no change" 1 (revenue_report) 0

Key finding: With inline value_diff, the agent immediately knows orders has 2 rows changed (amount +33%) and revenue_report has zero data changes. Without it, all 3 models are data_impact: "potential" — the agent must make 3+ additional tool calls to reach the same conclusion.

The 1 model that remained potential (customers) was due to a schema change (columns differ between base/current, so the FULL OUTER JOIN can't compare). This is correct behavior — the tool gracefully degrades and suggests profile_diff for further investigation.

Does Approach A achieve the original goal?

The original goal: extract mechanical steps from the Haiku subagent to reduce prompt complexity, improve correctness, and increase technical moat.

What impact_analysis successfully extracts (high-risk mechanical work):

  • dbt selector resolution (state:modified.body+ state:modified.macros+ state:modified.contract+)
  • Lineage DAG classification (modified vs downstream vs unimpacted)
  • Row count diff for all impacted models
  • Schema diff (column add/remove/modify detection)
  • Value diff via FULL OUTER JOIN with PK detection from dbt test metadata
  • Error classification (table_not_found, permission_denied) with per-step resilience
  • Deterministic deep dive suggestions (R1-R4 rules)

This is 383 lines of Python with 78 control flow branches — all of which the Haiku agent would otherwise need to orchestrate via 6-8 sequential tool calls with conditional logic at each step. The data_impact field collapses a three-way reasoning distinction (null vs zero vs not-run) into 3 unambiguous tokens.

What remains in agent.ts (~100 lines of mechanical prompt):

Section Lines Risk level Why it's still in prompt
Mermaid DAG generation ~80 Medium Pure template-following; summary.py already has generate_mermaid_lineage_graph()
create_check orchestration ~20 Medium Parameter construction (type, params, name) per model

Assessment: ~70% achieved, with clear path to 90%+

Goal Current After next steps
Reduce mechanical prompts 70% — high-risk work extracted, low-risk remains 90%+
Technical moat 85% — dbt internal APIs, cross-env SQL, PK detection in Python 95%
Simpler plugin 60% — agent.ts still ~250 lines 90% (down to ~50-60 lines)

Recommended next steps (by ROI)

Priority Action Impact Complexity
P0 Expose render_lineage_mermaid as MCP tool (code exists in summary.py) Eliminates ~95 lines from agent.ts Low
P1 Return suggested_checks[] with full type+params in impact_analysis Eliminates create_check parameter construction from agent Medium
P2 Measure Haiku error rate on Mermaid/create_check via Sentry metrics Validates whether remaining mechanical work actually causes errors Low

After P0+P1, agent.ts shrinks from ~250 lines to ~50-60 lines. The agent's job becomes: call impact_analysis → interpret results → follow suggested_checks → call render_lineage_mermaid → write analysis. All reasoning, minimal mechanics.

Design direction for this PR

  1. Keep value_diff inline — the A/B data confirms it's the right call for the "extract mechanical work" goal
  2. Replace suggested_deep_dives with next_action (per-model, more actionable) — this is the good idea from the original proposal
  3. skip_value_diff + skip_downstream_value_diff remain as escape hatches for large DAGs
  4. P0/P1 as follow-up PRs — don't block this PR on Mermaid extraction

@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 31, 2026

Updated Roadmap: Stage 1→2→3

After deeper analysis comparing the plugin agent (recce-reviewer in recce-claude-plugin) with the cloud agent (agent.ts in recce-cloud-infra), here's the refined roadmap.

Key finding: Plugin agent already surpasses cloud agent in analytical capability

Capability Plugin (recce-reviewer) Cloud (agent.ts)
Lineage classification via impact_analysis manual lineage_diff + parsing
Row count / Schema / Value diff via impact_analysis (1 call) 3 separate tool calls
data_impact (confirmed/none/potential) via impact_analysis agent infers manually
Risk assessment HIGH/MEDIUM/LOW rules no framework
DAG visualization not available ~80 lines Mermaid prompt
Check persistence not available create_check loop

The plugin agent uses impact_analysis as entry point and gets more analytical signal in 1 call than the cloud agent gets in 6+ calls. The cloud agent only wins on presentation (Mermaid DAG) and persistence (create_check).

Three-stage plan

Stage 1 (DONE): impact_analysis gives the plugin agent analytical parity+

  • Plugin already uses impact_analysis as entry point
  • data_impact + value_diff + risk assessment = analytical capability exceeds cloud agent

Stage 2 (this PR): Upgrade impact_analysis so cloud agent can also benefit

  • suggested_deep_divesnext_action (per-model, with priority)
  • next_action = null when data_impact is confirmed/none (agent only acts on potential)
  • Fix total_affected_row_count naming
  • Update _guidance to teach the triage workflow
  • Keep value_diff inline (A/B validated)
  • Cloud agent.ts prompt reduction: ~250 → ~150 lines

Stage 3 — PR B (follow-up), by priority:

Priority Action Impact Complexity
P0 render_lineage_mermaid MCP tool Eliminates ~95 lines from agent.ts Low (~20 lines — summary.py already has full implementation)
P1 suggested_checks[] with full type+params in impact_analysis Eliminates create_check parameter construction from agent Medium
P2 Measure Haiku error rate on remaining mechanical tasks via Sentry Validates remaining gaps Low

P0 detail: render_lineage_mermaid

summary.py already contains the complete Mermaid rendering pipeline: Node/Edge/LineageGraph classes, BFS from modified nodes, per-resource-type shapes, [What's Changed] annotations with row_count delta %, schema diffs, check integration, and MAX_MERMAID_TEXT_SIZE truncation. The MCP tool is a thin wrapper (~20 lines) over generate_mermaid_lineage_graph().

This is actually richer than what agent.ts teaches Haiku to produce — the Python version includes check annotations, resource-type shapes, and size-based truncation that the 80-line agent.ts prompt doesn't cover.

After P0, agent.ts shrinks to ~60 lines — same complexity level as recce-reviewer. The two agents converge to essentially the same workflow:

1. impact_analysis()           → analytical results
2. follow next_action          → profile_diff on potential models
3. render_lineage_mermaid()    → deterministic Mermaid (cloud only)
4. create_check()              → persist evidence (cloud only)
5. write summary

iamcxa and others added 2 commits March 31, 2026 18:03
- Move suggestion logic from top-level array to per-model `next_action` field
- next_action=null for confirmed/none models (no follow-up needed)
- next_action includes tool, columns, reason, priority for potential models
- Priority driven by code change type: modified+schema→high, downstream→medium
- Rename total_affected_row_count → max_affected_row_count (was computing max)
- Remove top-level suggested_deep_dives from response
- Update _guidance to teach next_action workflow
- Update E2E tests for field renames and next_action structure
- Update behavioral tests with next_action assertions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…logic

- Confirmed low change ratio → next_action=None
- data_impact='none' → next_action=None
- next_action field completeness (tool, columns, reason, priority)
- Response uses max_affected_row_count, not total_affected or suggested_deep_dives
- Schema change next_action has priority=high
- Downstream view next_action has priority=low

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 31, 2026

PR A Implementation Complete

Two commits pushed to implement the design changes discussed above:

Commit 1: refactor(mcp): replace suggested_deep_dives with per-model next_action

Changes to mcp_server.py:

  • Replaced top-level suggested_deep_dives array with per-model next_action field
  • data_impact=confirmed/nonenext_action=null (no follow-up needed)
  • data_impact=potentialnext_action={tool, columns, reason, priority}
  • Priority logic driven by code change type:
    • modified + schema_changes → high
    • modified + no value_diff (view/no PK) → high
    • downstream table → medium
    • downstream view → low
  • Confirmed models with high change ratio (R1) still get next_action (medium)
  • Renamed total_affected_row_countmax_affected_row_count
  • Updated _guidance to teach next_action workflow
  • Updated tool description to reference next_action instead of suggested_deep_dives

Changes to test_mcp_e2e.py:

  • Updated all field references (impacted_modelsconfirmed_impacted_models, etc.)
  • Converted suggested_deep_dives tests to next_action assertions
  • Fixed view row_count test (views now included in row_count per earlier commit)
  • Fixed per-column field name (rows_changedaffected_row_count)

Commit 2: test(mcp): add coverage for next_action, field renames, and priority logic

6 new tests covering gaps identified in coverage analysis:

Test What it verifies
test_confirmed_low_change_ratio_has_null_next_action confirmed + low change → null
test_none_data_impact_has_null_next_action none (zero changes) → null
test_next_action_has_all_required_fields tool, columns, reason, priority all present
test_response_uses_new_field_names max_affected_row_count exists, old fields absent
test_next_action_schema_change (priority assert) schema change → priority=high
test_next_action_downstream_view_low_priority downstream view → priority=low

All 67 tests pass (55 E2E + 11 behavioral + 1 registration).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants