feat(recce-dev): eval v2 framework + MCP impact_analysis improvements by iamcxa · Pull Request #18 · DataRecce/recce-claude-plugin

iamcxa · 2026-03-24T08:49:46Z

Summary

Comprehensive eval framework for measuring Recce plugin effectiveness, plus MCP tool improvements driven by eval findings.

Eval Framework (recce-eval skill)

Dual-layer scoring: deterministic criteria (model classification, row counts, dashboard impact) + LLM-as-judge for nuanced assessment
Two scenarios: R1 (subtotal-tax drift) and R2 (COGS miscalculation) — both require data comparison, not just code reading
Three eval modes: Baseline (no tools), Mode A (tool-only), Mode C (real-world with full plugin)
Isolation: --bare + --clean-profile for reproducible headless runs via claude -p
Infrastructure: run-case.sh, score-deterministic.sh, dual-schema setup (prod/dev for value_diff)

MCP Tool Improvements (companion PR: DataRecce/recce#1258)

Root cause found via eval: impact_analysis returns downstream models as "confirmed_impacted" with value_diff: null — agents can't distinguish DAG-reachable from data-affected → false positives
Fix: Extend value_diff to downstream table models + add data_impact field (confirmed/none/potential)
Guidance rewrite: "DO NOT OVERRIDE" → descriptive text explaining data_impact semantics

Eval Results (N=3, jaffle-shop-simulator + DuckDB)

Mode	R1 Pre-Fix	R1 Post-Fix	R2 Pre-Fix	R2 Post-Fix
Baseline	100%	—	100%	—
Mode A (tool-only)	96.3%	96.3%	95.2%	100%
Mode C (real-world)	88.9%	81.5%	100%	100%

Key finding: Agent tool selection is non-deterministic — impact_analysis was not called in any post-fix run (0/17). MCP fix is defense-in-depth; accuracy depends on agent judgment, not tool quality alone.

Key Eval Insights

Tools without evidence are worse than no tools — confirmed_impacted + value_diff: null creates false authority
"Trust the tool" prompt injection causes false positives (Mode C 100% → 85.7%)
Efficiency gain confirmed: Mode A/C cost 50% less than Baseline ($0.20 vs $0.34/run)

How Scenarios Are Designed (using jaffle-shop-simulator)

Reverse Patch Approach

jaffle-shop-simulator main is always correct code. Scenarios don't maintain bug branches — instead:

Patches are stored in reverse format (clean → buggy diff)
Setup applies git apply --reverse to introduce the bug
Teardown runs git checkout -- to restore clean state

Dual-Schema Setup (DuckDB)

DuckDB uses a single file. With one schema, Recce compares dev vs dev → always 0 differences. So run-case.sh builds two schemas:

1. dbt run --target prod --full-refresh     → prod schema = clean (base)
2. dbt docs generate --target-path target-base --target prod
3. git apply --reverse <patch>              → code becomes buggy
4. dbt run --target dev --full-refresh      → dev schema = buggy (current)
5. dbt docs generate                        → target/ = current artifacts

Recce MCP server reads target/ (current) and target-base/ (base) manifests, then compares actual data across dev.* vs prod.* tables.

Scenario Design Criteria

A good scenario must satisfy all of these:

Criterion	Why	R1 Example	R2 Example
Code review would approve	Tests agent's ability to go beyond code reading	`subtotal - tax_paid` is valid syntax	`CASE WHEN is_food_item` is a standard pattern
PR description is misleading	Simulates real-world PRs where intent ≠ effect	"raw data includes tax in subtotal"	"filter to food supply costs for performance"
All dbt tests pass	Can't rely on test failures to detect issues	No subtotal constraint test exists	No order_cost completeness test exists
Detection requires data comparison	Validates that agent uses data tools, not just code analysis	Must query raw data to verify subtotal is already pre-tax	Must compare aggregated vs correct COGS to find the gap
Clear impacted vs not_impacted models	Tests model classification accuracy	customers IS impacted (uses subtotal), order_items is NOT	customers is NOT impacted (doesn't use order_cost)

How to Create a New Scenario

cd jaffle-shop-simulator

# 1. Manually introduce a plausible-but-wrong change
vim models/staging/stg_orders.sql

# 2. Generate reverse patch (buggy → clean direction)
git diff > /path/to/scenario-name.patch
git checkout -- models/staging/stg_orders.sql

# 3. Verify ground truth numbers via SQL
dbt run --target prod --full-refresh
git apply --reverse scenario-name.patch
dbt run --target dev --full-refresh

duckdb data/jaffel-shop.duckdb <<SQL
  SELECT COUNT(*) as affected
  FROM dev.orders d JOIN prod.orders p ON d.order_id = p.order_id
  WHERE d.subtotal != p.subtotal;
SQL
# → use this number as affected_row_count in ground_truth

# 4. Write scenario YAML with ground_truth, judge_criteria, teardown

Current Scenarios

ID	Change	Affected Rows	Key Challenge
R1	`stg_orders.subtotal` double-deducts tax	654,502 (99.4%)	Propagation through orders → customers; view-modified model chain
R2	`orders.order_cost` excludes drink supply costs	643,875 (98%)	Understanding `is_food_item` filter semantics; customers NOT impacted

Changed Files (45 files, +2744/-78)

Eval Infrastructure

plugins/recce-dev/skills/recce-eval/ — SKILL.md, scripts, scenarios, patches, prompts
plugins/recce-dev/agents/eval-judge.md — LLM-as-judge agent
tests/recce-eval/ — Script validation tests

Plugin Improvements

plugins/recce/ — stdio MCP transport, Option E hook cleanup, impact_analysis workflow migration

Test plan

run-case.sh argument validation tests pass
score-deterministic.sh scoring tests pass
R1 + R2 scenarios produce valid JSON output with --dry-run
MCP fix tested in companion PR feat(mcp): impact_analysis v2 — downstream value_diff + data_impact field recce#1258 (101 unit tests)

🤖 Generated with Claude Code

- Switch .mcp.json from SSE to stdio transport via run-mcp-stdio.sh wrapper (venv auto-detection, no external server to manage) - Remove IMPACT_RULE from SessionStart hook — guidance now embedded in impact_analysis tool description + response (Option E) - Simplify session-start.sh: remove SSE server lifecycle management, replace with MCP_READY readiness check - Add agent constraint: prohibit Python/curl bypass of MCP tools Eval validation (ch3-phantom-filter, n=3): bare+Option E v2 delta: +2.7 (vs clean-profile +3.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speed up iterative eval by allowing: - --skip-setup: reuse pre-applied patch + dbt state (no re-setup) - --skip-teardown: keep state for subsequent parallel runs - --model: choose model (e.g., sonnet) for faster/cheaper runs Enables: setup once → parallel claude sessions → teardown once. Reduces n=3 batch time from ~30-60min to ~10-15min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iamcxa · 2026-03-24T09:44:10Z

Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)

Batch ID: 20260324-1656 | Model: Sonnet | Isolation: --bare | Adapter: DuckDB

Results

Scenario	Ch	Baseline	With-Plugin	Delta	Cost
ch1-healthy-audit	1	0/4	2/4	+2	$0.74
ch1-null-amounts	1	9/9	9/9	0	$0.84
ch2-amount-misscale	2	11/12	11/12	0	$0.67
ch2-silent-filter	2	11/12	11/12	0	$0.79
ch3-count-distinct	3	12/12	❌ patch fail	—	$0.41
ch3-join-shift	3	9/12	9/12	0	$0.76
ch3-phantom-filter	3	8/12	8/12	0	$0.86

Total: 13/14 runs successful, $5.07 total cost

Key Findings

Option E tool-embedded guidance works in bare mode — with-plugin runs have access to impact_analysis tool with embedded IMPORTANT call-ordering block + _guidance response metadata.
ch3-count-distinct with-plugin failed — patch apply error (previous baseline's orders_daily_summary.sql teardown left modified state). Not an Option E issue — infrastructure bug in sequential batch runner.
ch1-healthy-audit shows plugin value — baseline 0/4 vs with-plugin 2/4. The plugin's impact_analysis correctly reports no data changes, helping the agent produce a cleaner "no issues" report.
ch3-phantom-filter — baseline 8/12 = with-plugin 8/12 (n=1, high variance scenario). Previous n=3 runs showed +2.7 delta. Single-run comparison is noisy.

Configuration

.mcp.json: stdio transport (via run-mcp-stdio.sh wrapper)
SessionStart hook: IMPACT_RULE removed, replaced by tool-embedded guidance
Agent constraint: Python/curl SSE bypass prohibited
Selector default: state:modified.body+ state:modified.macros+ state:modified.contract+

Companion PR

recce#1233 (merged) — tool description + selector narrowing in mcp_server.py
recce#1241 — node_id_by_name UnboundLocalError fix (from Copilot review)

…a patterns Agent was hallucinating data quality issues (e.g., "310 completed orders with $0") because the prompt asked "any data quality issues?" without defining what constitutes an actual pipeline bug vs inherent data patterns. Changes: - Prompt now explicitly asks for PIPELINE BUGS, not general data patterns - Lists known expected patterns (placed orders, partial payments, etc.) - Scoring keywords expanded: added "corrupted", "data loss", "regression" - Judge criteria updated: evaluate bug vs pattern distinction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iamcxa · 2026-03-24T14:17:25Z

Eval Results v3 — Full Suite with Proper Isolation

Batch ID: 20260324-1800 | Model: Sonnet | Isolation: --bare + full DuckDB reset | Adapter: DuckDB

Previous eval (20260324-1656) was invalid — patch state leaked across scenarios. v3 adds git checkout . && dbt run --full-refresh between every run.

Results

Scenario	Ch	Baseline	With-Plugin	Delta
ch1-healthy-audit	1	2/4	2/4	0
ch1-null-amounts	1	9/9	9/9	0
ch2-amount-misscale	2	11/12	12/12	+1
ch2-silent-filter	2	11/12	11/12	0
ch3-count-distinct	3	12/12	12/12	0
ch3-join-shift	3	9/12	9/12	0
ch3-phantom-filter	3	8/12	11/12	+3
Total		62/73	66/73	+4

Total cost: $7.14 | 14 runs | 0 failures

Key Findings

ch3-phantom-filter +3: Plugin's impact_analysis with embedded guidance (Option E) correctly classifies blast radius. Baseline misses downstream models.
ch2-amount-misscale +1: Plugin achieves perfect 12/12 vs baseline 11/12.
ch1-healthy-audit 2/4 (both): Improved from 0/4 after prompt fix (distinguish pipeline bugs from data patterns). Remaining 2 failures likely due to agents still flagging seed data patterns as issues.
ch3-count-distinct: profiles.yml target mismatch during with-plugin setup (dev-local not found), but agent still scored 12/12 by working around it.

Fixes in This Batch

Scenario isolation: git checkout . && dbt run --full-refresh between every run (fixes patch leakage from v1/v2)
ch1-healthy-audit prompt: Explicitly distinguishes pipeline bugs from inherent data patterns
false_positive_keywords: Expanded to include "corrupted", "data loss", "regression"

Append Recce Review workflow instructions to with-plugin prompts when MCP tools are available. Simulates real /recce-review usage path: agent uses impact_analysis as verification layer, not just generic dbt CLI analysis. Only injected when --mcp-config is set (scenarios without MCP tools like ch1-null-amounts are unaffected). Wording: "After completing your own analysis, use these tools to verify" — enriches rather than replaces the agent's primary workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stdio MCP server holds DuckDB write lock during session, blocking dbt run/test. Scenarios with skip_context=true need the agent to run dbt itself — skip MCP config and workflow prompt injection for these scenarios. Fixes ch1-null-amounts regression (9/9 baseline → 7/9 with-plugin was caused by DuckDB lock conflict, not plugin quality). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Aligns ch1 with the Recce plugin's actual value proposition — data impact review, not code fixing. Agent now uses MCP tools to analyze impact scope instead of running dbt run/test itself. Changes: - Remove skip_context: true (setup now injects dbt results) - Prompt: "review the data impact" instead of "fix the root cause" - Ground truth: all_tests_pass=false (COALESCE removal causes failures) - Remove fix_applied from expected output Note: ch1's bug causes test failures, which is a weaker test of Recce value compared to ch2/ch3 where all tests pass (silent bugs). Consider redesigning ch1 for a "tests pass but data is wrong" scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iamcxa · 2026-03-26T02:14:08Z

Eval Results v6 — MCP Field Naming + Workflow Prompt + View Fix

Batch ID: 20260325-v6 | Model: Sonnet | Isolation: --bare + full DuckDB reset | Adapter: DuckDB

Results

Scenario	Ch	Baseline	With-Plugin	Delta
ch1-null-amounts	1	9/9	8/9	-1
ch2-amount-misscale	2	11/12	11/12	0
ch2-silent-filter	2	11/12	11/12	0
ch3-count-distinct	3	12/12	12/12	0
ch3-join-shift	3	9/12	9/12	0
ch3-phantom-filter	3	7/12	12/12	+5
Total		59/66	63/66	+4

Total cost: $5.93 | 14 runs (incl ch1-healthy-audit) | 0 failures

Headline: ch3-phantom-filter +5 delta

Baseline (7/12) failed on:

Missing 4 downstream models (orders, orders_daily_summary, customer_segments, customer_order_pattern)
Wrong affected_row_count (2838 vs 2326)

Plugin (12/12) succeeded because:

confirmed_impacted_models copied directly from DAG classification — no override
total_affected_row_count = 2326 (from view row_count delta) — copied directly

This is the plugin's core value: baseline can't track DAG propagation accurately; plugin's lineage-based classification is authoritative.

MCP Optimizations in This Run

confirmed_impacted_models — "confirmed" prefix signals authority, agent copies instead of overriding
total_affected_row_count — pre-computed aggregate, agent uses directly
View row_count fix — views now included in row_count_diff (stg_payments -2326 visible)
Workflow prompt — "After completing your analysis, use MCP tools to verify" (with-plugin only)
Narrow selector — state:modified.body+ state:modified.macros+ state:modified.contract+

Evolution

Version	ch3-phantom-filter plugin score	Total delta
v1 (no guidance)	8/12	+0.4
v2 (Option E)	10-11/12	+2.7
v3 (confirmed_ fields)	11/12	+3
v6 (all optimizations)	12/12	+5

Known Issues

ch1-null-amounts: with-plugin 8/9 (-1) — affected_row_count: 0 because NULL→value change not detected by PK Join value_diff. MCP server issue, not field naming.
ch1-healthy-audit: scorer doesn't output PASS_COUNT for no_problem case type — needs scorer fix.
ch3-join-shift: 9/12 both — agent misses downstream models even with plugin. May need stronger prompt or n>1 for statistical significance.

Phase 1 of eval v2 redesign (jaffle-shop-simulator environment): - eval-config.yaml: pins to commit 021e2d3, DuckDB adapter, dashboard columns - prompts/review.md: shared prompt template with stakeholder context + dashboard awareness - render-prompt.py: template variable substitution from scenario YAML + CLI overrides - R1 (tax-calculation-drift): NaN tax_rate from div-by-zero on 4,155 subtotal=0 orders - R2 (cogs-miscalculation): order_cost misses drink supply costs, 643,875 orders affected Both scenarios follow the "code review would approve" design principle — bugs are only detectable through data-level comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…chema comparison Score-deterministic: add dashboard_impact check (v2), make all_tests_pass optional (v1 compat), use has() to avoid jq // false gotcha on booleans. Run-case: build base state in prod schema before applying patch so Recce value_diff compares dev (buggy) vs prod (clean) instead of dev vs dev. R1 scenario: changed from adding new tax_rate column to modifying existing subtotal column (subtotal - tax_paid). Existing column means value_diff detects real data changes. 654,502 affected rows, dashboard_impact=true. R2 scenario: removed gross_profit column addition, kept only is_food_item filter change to order_cost. Single-hunk patch modifying existing column. Both patches stored as buggy→clean (v1 convention) for git apply --reverse. Reduced not_impacted_models to ambiguous cases only (2-3 models vs 10+). Validated: with-plugin now gets exact affected_row_count via MCP value_diff (was 0 before dual-schema fix). Baseline scores 100% due to unrestricted DuckDB SQL access — structural limitation, not a scenario design issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mode A (tool-only): with-plugin gets MCP tools only — no plugin-dir, no hooks/skills/agents, no workflow prompt. Tests raw tool value. Mode C (real-world): with-plugin gets full plugin experience including plugin-dir, hooks, and reviewer workflow prompt (default, backwards compatible). Baseline variant ignores mode — always runs vanilla Claude Code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds v2 deterministic checks (dashboard_impact, ±20% tolerance) and v2 LLM judge dimensions (Evidence Quality 0-3, Self-Verification 0-3). Preserves v1 section for backwards compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mode C (real-world) should rely on plugin hooks/skills to naturally guide the agent, not inject a [Recce Review Context] prompt that overrides the agent's judgment with "Trust MCP over code-reading". The injected prompt was causing false positives — agent over-included downstream models based on DAG classification instead of actual data impact. Without it, R2 Mode C improved from 85.7% to 100%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…scenario - Commission Spacedock workflow for eval scenario management (draft → ground-truth → eval-run → validated → integrated) - Add 3 seed scenarios from jaffle-shop-simulator issues #8, #2, #6 - Complete R8 scenario: stg_orders filters on subtotal instead of order_total — spec deviation bug (difficulty: hard) - Add assignee field to schema for multi-person coordination Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kentwelcome

LGTM.
By the way, add a Notion page for adding a new v2 eval scenario link

iamcxa and others added 2 commits March 24, 2026 16:48

iamcxa self-assigned this Mar 24, 2026

iamcxa marked this pull request as ready for review March 24, 2026 08:53

iamcxa and others added 3 commits March 25, 2026 22:11

iamcxa changed the title ~~feat(recce): stdio MCP transport + Option E eval improvements~~ feat(recce-dev): eval v2 framework + MCP impact_analysis improvements Mar 27, 2026

iamcxa and others added 6 commits March 30, 2026 23:55

iamcxa requested a review from kentwelcome March 30, 2026 23:46

kentwelcome approved these changes Mar 31, 2026

View reviewed changes

kentwelcome merged commit 31721c0 into main Mar 31, 2026

kentwelcome deleted the feat/plugin-rename-and-recce-dev branch March 31, 2026 04:00

iamcxa mentioned this pull request Apr 1, 2026

feat(mcp): impact_analysis v2 — downstream value_diff + data_impact field DataRecce/recce#1258

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recce-dev): eval v2 framework + MCP impact_analysis improvements#18

feat(recce-dev): eval v2 framework + MCP impact_analysis improvements#18
kentwelcome merged 12 commits intomainfrom
feat/plugin-rename-and-recce-dev

iamcxa commented Mar 24, 2026 •

edited

Loading

Uh oh!

iamcxa commented Mar 24, 2026

Uh oh!

iamcxa commented Mar 24, 2026

Uh oh!

iamcxa commented Mar 26, 2026

Uh oh!

kentwelcome left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iamcxa commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Eval Framework (recce-eval skill)

MCP Tool Improvements (companion PR: DataRecce/recce#1258)

Eval Results (N=3, jaffle-shop-simulator + DuckDB)

Key Eval Insights

How Scenarios Are Designed (using jaffle-shop-simulator)

Reverse Patch Approach

Dual-Schema Setup (DuckDB)

Scenario Design Criteria

How to Create a New Scenario

Current Scenarios

Changed Files (45 files, +2744/-78)

Eval Infrastructure

Plugin Improvements

Test plan

Uh oh!

iamcxa commented Mar 24, 2026

Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)

Results

Key Findings

Configuration

Companion PR

Uh oh!

iamcxa commented Mar 24, 2026

Eval Results v3 — Full Suite with Proper Isolation

Results

Key Findings

Fixes in This Batch

Uh oh!

iamcxa commented Mar 26, 2026

Eval Results v6 — MCP Field Naming + Workflow Prompt + View Fix

Results

Headline: ch3-phantom-filter +5 delta

MCP Optimizations in This Run

Evolution

Known Issues

Uh oh!

kentwelcome left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iamcxa commented Mar 24, 2026 •

edited

Loading