Skip to content

feat(recce-dev): eval v2 framework + MCP impact_analysis improvements#18

Merged
kentwelcome merged 12 commits intomainfrom
feat/plugin-rename-and-recce-dev
Mar 31, 2026
Merged

feat(recce-dev): eval v2 framework + MCP impact_analysis improvements#18
kentwelcome merged 12 commits intomainfrom
feat/plugin-rename-and-recce-dev

Conversation

@iamcxa
Copy link
Copy Markdown
Contributor

@iamcxa iamcxa commented Mar 24, 2026

Summary

Comprehensive eval framework for measuring Recce plugin effectiveness, plus MCP tool improvements driven by eval findings.

Eval Framework (recce-eval skill)

  • Dual-layer scoring: deterministic criteria (model classification, row counts, dashboard impact) + LLM-as-judge for nuanced assessment
  • Two scenarios: R1 (subtotal-tax drift) and R2 (COGS miscalculation) — both require data comparison, not just code reading
  • Three eval modes: Baseline (no tools), Mode A (tool-only), Mode C (real-world with full plugin)
  • Isolation: --bare + --clean-profile for reproducible headless runs via claude -p
  • Infrastructure: run-case.sh, score-deterministic.sh, dual-schema setup (prod/dev for value_diff)

MCP Tool Improvements (companion PR: DataRecce/recce#1258)

  • Root cause found via eval: impact_analysis returns downstream models as "confirmed_impacted" with value_diff: null — agents can't distinguish DAG-reachable from data-affected → false positives
  • Fix: Extend value_diff to downstream table models + add data_impact field (confirmed/none/potential)
  • Guidance rewrite: "DO NOT OVERRIDE" → descriptive text explaining data_impact semantics

Eval Results (N=3, jaffle-shop-simulator + DuckDB)

Mode R1 Pre-Fix R1 Post-Fix R2 Pre-Fix R2 Post-Fix
Baseline 100% 100%
Mode A (tool-only) 96.3% 96.3% 95.2% 100%
Mode C (real-world) 88.9% 81.5% 100% 100%

Key finding: Agent tool selection is non-deterministic — impact_analysis was not called in any post-fix run (0/17). MCP fix is defense-in-depth; accuracy depends on agent judgment, not tool quality alone.

Key Eval Insights

  • Tools without evidence are worse than no tools — confirmed_impacted + value_diff: null creates false authority
  • "Trust the tool" prompt injection causes false positives (Mode C 100% → 85.7%)
  • Efficiency gain confirmed: Mode A/C cost 50% less than Baseline ($0.20 vs $0.34/run)

How Scenarios Are Designed (using jaffle-shop-simulator)

Reverse Patch Approach

jaffle-shop-simulator main is always correct code. Scenarios don't maintain bug branches — instead:

  1. Patches are stored in reverse format (clean → buggy diff)
  2. Setup applies git apply --reverse to introduce the bug
  3. Teardown runs git checkout -- to restore clean state

Dual-Schema Setup (DuckDB)

DuckDB uses a single file. With one schema, Recce compares dev vs dev → always 0 differences. So run-case.sh builds two schemas:

1. dbt run --target prod --full-refresh     → prod schema = clean (base)
2. dbt docs generate --target-path target-base --target prod
3. git apply --reverse <patch>              → code becomes buggy
4. dbt run --target dev --full-refresh      → dev schema = buggy (current)
5. dbt docs generate                        → target/ = current artifacts

Recce MCP server reads target/ (current) and target-base/ (base) manifests, then compares actual data across dev.* vs prod.* tables.

Scenario Design Criteria

A good scenario must satisfy all of these:

Criterion Why R1 Example R2 Example
Code review would approve Tests agent's ability to go beyond code reading subtotal - tax_paid is valid syntax CASE WHEN is_food_item is a standard pattern
PR description is misleading Simulates real-world PRs where intent ≠ effect "raw data includes tax in subtotal" "filter to food supply costs for performance"
All dbt tests pass Can't rely on test failures to detect issues No subtotal constraint test exists No order_cost completeness test exists
Detection requires data comparison Validates that agent uses data tools, not just code analysis Must query raw data to verify subtotal is already pre-tax Must compare aggregated vs correct COGS to find the gap
Clear impacted vs not_impacted models Tests model classification accuracy customers IS impacted (uses subtotal), order_items is NOT customers is NOT impacted (doesn't use order_cost)

How to Create a New Scenario

cd jaffle-shop-simulator

# 1. Manually introduce a plausible-but-wrong change
vim models/staging/stg_orders.sql

# 2. Generate reverse patch (buggy → clean direction)
git diff > /path/to/scenario-name.patch
git checkout -- models/staging/stg_orders.sql

# 3. Verify ground truth numbers via SQL
dbt run --target prod --full-refresh
git apply --reverse scenario-name.patch
dbt run --target dev --full-refresh

duckdb data/jaffel-shop.duckdb <<SQL
  SELECT COUNT(*) as affected
  FROM dev.orders d JOIN prod.orders p ON d.order_id = p.order_id
  WHERE d.subtotal != p.subtotal;
SQL
# → use this number as affected_row_count in ground_truth

# 4. Write scenario YAML with ground_truth, judge_criteria, teardown

Current Scenarios

ID Change Affected Rows Key Challenge
R1 stg_orders.subtotal double-deducts tax 654,502 (99.4%) Propagation through orders → customers; view-modified model chain
R2 orders.order_cost excludes drink supply costs 643,875 (98%) Understanding is_food_item filter semantics; customers NOT impacted

Changed Files (45 files, +2744/-78)

Eval Infrastructure

  • plugins/recce-dev/skills/recce-eval/ — SKILL.md, scripts, scenarios, patches, prompts
  • plugins/recce-dev/agents/eval-judge.md — LLM-as-judge agent
  • tests/recce-eval/ — Script validation tests

Plugin Improvements

  • plugins/recce/ — stdio MCP transport, Option E hook cleanup, impact_analysis workflow migration

Test plan

🤖 Generated with Claude Code

iamcxa and others added 2 commits March 24, 2026 16:48
- Switch .mcp.json from SSE to stdio transport via run-mcp-stdio.sh
  wrapper (venv auto-detection, no external server to manage)
- Remove IMPACT_RULE from SessionStart hook — guidance now embedded
  in impact_analysis tool description + response (Option E)
- Simplify session-start.sh: remove SSE server lifecycle management,
  replace with MCP_READY readiness check
- Add agent constraint: prohibit Python/curl bypass of MCP tools

Eval validation (ch3-phantom-filter, n=3):
  bare+Option E v2 delta: +2.7 (vs clean-profile +3.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speed up iterative eval by allowing:
- --skip-setup: reuse pre-applied patch + dbt state (no re-setup)
- --skip-teardown: keep state for subsequent parallel runs
- --model: choose model (e.g., sonnet) for faster/cheaper runs

Enables: setup once → parallel claude sessions → teardown once.
Reduces n=3 batch time from ~30-60min to ~10-15min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iamcxa iamcxa self-assigned this Mar 24, 2026
@iamcxa iamcxa marked this pull request as ready for review March 24, 2026 08:53
@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 24, 2026

Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)

Batch ID: 20260324-1656 | Model: Sonnet | Isolation: --bare | Adapter: DuckDB

Results

Scenario Ch Baseline With-Plugin Delta Cost
ch1-healthy-audit 1 0/4 2/4 +2 $0.74
ch1-null-amounts 1 9/9 9/9 0 $0.84
ch2-amount-misscale 2 11/12 11/12 0 $0.67
ch2-silent-filter 2 11/12 11/12 0 $0.79
ch3-count-distinct 3 12/12 ❌ patch fail $0.41
ch3-join-shift 3 9/12 9/12 0 $0.76
ch3-phantom-filter 3 8/12 8/12 0 $0.86

Total: 13/14 runs successful, $5.07 total cost

Key Findings

  1. Option E tool-embedded guidance works in bare mode — with-plugin runs have access to impact_analysis tool with embedded IMPORTANT call-ordering block + _guidance response metadata.

  2. ch3-count-distinct with-plugin failed — patch apply error (previous baseline's orders_daily_summary.sql teardown left modified state). Not an Option E issue — infrastructure bug in sequential batch runner.

  3. ch1-healthy-audit shows plugin value — baseline 0/4 vs with-plugin 2/4. The plugin's impact_analysis correctly reports no data changes, helping the agent produce a cleaner "no issues" report.

  4. ch3-phantom-filter — baseline 8/12 = with-plugin 8/12 (n=1, high variance scenario). Previous n=3 runs showed +2.7 delta. Single-run comparison is noisy.

Configuration

  • .mcp.json: stdio transport (via run-mcp-stdio.sh wrapper)
  • SessionStart hook: IMPACT_RULE removed, replaced by tool-embedded guidance
  • Agent constraint: Python/curl SSE bypass prohibited
  • Selector default: state:modified.body+ state:modified.macros+ state:modified.contract+

Companion PR

  • recce#1233 (merged) — tool description + selector narrowing in mcp_server.py
  • recce#1241node_id_by_name UnboundLocalError fix (from Copilot review)

…a patterns

Agent was hallucinating data quality issues (e.g., "310 completed orders
with $0") because the prompt asked "any data quality issues?" without
defining what constitutes an actual pipeline bug vs inherent data patterns.

Changes:
- Prompt now explicitly asks for PIPELINE BUGS, not general data patterns
- Lists known expected patterns (placed orders, partial payments, etc.)
- Scoring keywords expanded: added "corrupted", "data loss", "regression"
- Judge criteria updated: evaluate bug vs pattern distinction

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 24, 2026

Eval Results v3 — Full Suite with Proper Isolation

Batch ID: 20260324-1800 | Model: Sonnet | Isolation: --bare + full DuckDB reset | Adapter: DuckDB

Previous eval (20260324-1656) was invalid — patch state leaked across scenarios. v3 adds git checkout . && dbt run --full-refresh between every run.

Results

Scenario Ch Baseline With-Plugin Delta
ch1-healthy-audit 1 2/4 2/4 0
ch1-null-amounts 1 9/9 9/9 0
ch2-amount-misscale 2 11/12 12/12 +1
ch2-silent-filter 2 11/12 11/12 0
ch3-count-distinct 3 12/12 12/12 0
ch3-join-shift 3 9/12 9/12 0
ch3-phantom-filter 3 8/12 11/12 +3
Total 62/73 66/73 +4

Total cost: $7.14 | 14 runs | 0 failures

Key Findings

  1. ch3-phantom-filter +3: Plugin's impact_analysis with embedded guidance (Option E) correctly classifies blast radius. Baseline misses downstream models.

  2. ch2-amount-misscale +1: Plugin achieves perfect 12/12 vs baseline 11/12.

  3. ch1-healthy-audit 2/4 (both): Improved from 0/4 after prompt fix (distinguish pipeline bugs from data patterns). Remaining 2 failures likely due to agents still flagging seed data patterns as issues.

  4. ch3-count-distinct: profiles.yml target mismatch during with-plugin setup (dev-local not found), but agent still scored 12/12 by working around it.

Fixes in This Batch

  • Scenario isolation: git checkout . && dbt run --full-refresh between every run (fixes patch leakage from v1/v2)
  • ch1-healthy-audit prompt: Explicitly distinguishes pipeline bugs from inherent data patterns
  • false_positive_keywords: Expanded to include "corrupted", "data loss", "regression"

iamcxa and others added 3 commits March 25, 2026 22:11
Append Recce Review workflow instructions to with-plugin prompts when
MCP tools are available. Simulates real /recce-review usage path:
agent uses impact_analysis as verification layer, not just generic
dbt CLI analysis.

Only injected when --mcp-config is set (scenarios without MCP tools
like ch1-null-amounts are unaffected).

Wording: "After completing your own analysis, use these tools to verify"
— enriches rather than replaces the agent's primary workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
stdio MCP server holds DuckDB write lock during session, blocking
dbt run/test. Scenarios with skip_context=true need the agent to
run dbt itself — skip MCP config and workflow prompt injection for
these scenarios.

Fixes ch1-null-amounts regression (9/9 baseline → 7/9 with-plugin
was caused by DuckDB lock conflict, not plugin quality).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aligns ch1 with the Recce plugin's actual value proposition — data
impact review, not code fixing. Agent now uses MCP tools to analyze
impact scope instead of running dbt run/test itself.

Changes:
- Remove skip_context: true (setup now injects dbt results)
- Prompt: "review the data impact" instead of "fix the root cause"
- Ground truth: all_tests_pass=false (COALESCE removal causes failures)
- Remove fix_applied from expected output

Note: ch1's bug causes test failures, which is a weaker test of Recce
value compared to ch2/ch3 where all tests pass (silent bugs). Consider
redesigning ch1 for a "tests pass but data is wrong" scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iamcxa
Copy link
Copy Markdown
Contributor Author

iamcxa commented Mar 26, 2026

Eval Results v6 — MCP Field Naming + Workflow Prompt + View Fix

Batch ID: 20260325-v6 | Model: Sonnet | Isolation: --bare + full DuckDB reset | Adapter: DuckDB

Results

Scenario Ch Baseline With-Plugin Delta
ch1-null-amounts 1 9/9 8/9 -1
ch2-amount-misscale 2 11/12 11/12 0
ch2-silent-filter 2 11/12 11/12 0
ch3-count-distinct 3 12/12 12/12 0
ch3-join-shift 3 9/12 9/12 0
ch3-phantom-filter 3 7/12 12/12 +5
Total 59/66 63/66 +4

Total cost: $5.93 | 14 runs (incl ch1-healthy-audit) | 0 failures

Headline: ch3-phantom-filter +5 delta

Baseline (7/12) failed on:

  • Missing 4 downstream models (orders, orders_daily_summary, customer_segments, customer_order_pattern)
  • Wrong affected_row_count (2838 vs 2326)

Plugin (12/12) succeeded because:

  • confirmed_impacted_models copied directly from DAG classification — no override
  • total_affected_row_count = 2326 (from view row_count delta) — copied directly

This is the plugin's core value: baseline can't track DAG propagation accurately; plugin's lineage-based classification is authoritative.

MCP Optimizations in This Run

  1. confirmed_impacted_models — "confirmed" prefix signals authority, agent copies instead of overriding
  2. total_affected_row_count — pre-computed aggregate, agent uses directly
  3. View row_count fix — views now included in row_count_diff (stg_payments -2326 visible)
  4. Workflow prompt — "After completing your analysis, use MCP tools to verify" (with-plugin only)
  5. Narrow selectorstate:modified.body+ state:modified.macros+ state:modified.contract+

Evolution

Version ch3-phantom-filter plugin score Total delta
v1 (no guidance) 8/12 +0.4
v2 (Option E) 10-11/12 +2.7
v3 (confirmed_ fields) 11/12 +3
v6 (all optimizations) 12/12 +5

Known Issues

  • ch1-null-amounts: with-plugin 8/9 (-1) — affected_row_count: 0 because NULL→value change not detected by PK Join value_diff. MCP server issue, not field naming.
  • ch1-healthy-audit: scorer doesn't output PASS_COUNT for no_problem case type — needs scorer fix.
  • ch3-join-shift: 9/12 both — agent misses downstream models even with plugin. May need stronger prompt or n>1 for statistical significance.

@iamcxa iamcxa changed the title feat(recce): stdio MCP transport + Option E eval improvements feat(recce-dev): eval v2 framework + MCP impact_analysis improvements Mar 27, 2026
iamcxa and others added 6 commits March 30, 2026 23:55
Phase 1 of eval v2 redesign (jaffle-shop-simulator environment):
- eval-config.yaml: pins to commit 021e2d3, DuckDB adapter, dashboard columns
- prompts/review.md: shared prompt template with stakeholder context + dashboard awareness
- render-prompt.py: template variable substitution from scenario YAML + CLI overrides
- R1 (tax-calculation-drift): NaN tax_rate from div-by-zero on 4,155 subtotal=0 orders
- R2 (cogs-miscalculation): order_cost misses drink supply costs, 643,875 orders affected

Both scenarios follow the "code review would approve" design principle —
bugs are only detectable through data-level comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…chema comparison

Score-deterministic: add dashboard_impact check (v2), make all_tests_pass
optional (v1 compat), use has() to avoid jq // false gotcha on booleans.

Run-case: build base state in prod schema before applying patch so Recce
value_diff compares dev (buggy) vs prod (clean) instead of dev vs dev.

R1 scenario: changed from adding new tax_rate column to modifying existing
subtotal column (subtotal - tax_paid). Existing column means value_diff
detects real data changes. 654,502 affected rows, dashboard_impact=true.

R2 scenario: removed gross_profit column addition, kept only is_food_item
filter change to order_cost. Single-hunk patch modifying existing column.

Both patches stored as buggy→clean (v1 convention) for git apply --reverse.
Reduced not_impacted_models to ambiguous cases only (2-3 models vs 10+).

Validated: with-plugin now gets exact affected_row_count via MCP value_diff
(was 0 before dual-schema fix). Baseline scores 100% due to unrestricted
DuckDB SQL access — structural limitation, not a scenario design issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mode A (tool-only): with-plugin gets MCP tools only — no plugin-dir,
no hooks/skills/agents, no workflow prompt. Tests raw tool value.

Mode C (real-world): with-plugin gets full plugin experience including
plugin-dir, hooks, and reviewer workflow prompt (default, backwards
compatible).

Baseline variant ignores mode — always runs vanilla Claude Code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds v2 deterministic checks (dashboard_impact, ±20% tolerance) and
v2 LLM judge dimensions (Evidence Quality 0-3, Self-Verification 0-3).
Preserves v1 section for backwards compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mode C (real-world) should rely on plugin hooks/skills to naturally
guide the agent, not inject a [Recce Review Context] prompt that
overrides the agent's judgment with "Trust MCP over code-reading".

The injected prompt was causing false positives — agent over-included
downstream models based on DAG classification instead of actual data
impact. Without it, R2 Mode C improved from 85.7% to 100%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…scenario

- Commission Spacedock workflow for eval scenario management
  (draft → ground-truth → eval-run → validated → integrated)
- Add 3 seed scenarios from jaffle-shop-simulator issues #8, #2, #6
- Complete R8 scenario: stg_orders filters on subtotal instead of
  order_total — spec deviation bug (difficulty: hard)
- Add assignee field to schema for multi-person coordination

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iamcxa iamcxa requested a review from kentwelcome March 30, 2026 23:46
Copy link
Copy Markdown
Member

@kentwelcome kentwelcome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
By the way, add a Notion page for adding a new v2 eval scenario link

@kentwelcome kentwelcome merged commit 31721c0 into main Mar 31, 2026
@kentwelcome kentwelcome deleted the feat/plugin-rename-and-recce-dev branch March 31, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants