feat(recce-dev): eval v2 framework + MCP impact_analysis improvements#18
feat(recce-dev): eval v2 framework + MCP impact_analysis improvements#18kentwelcome merged 12 commits intomainfrom
Conversation
- Switch .mcp.json from SSE to stdio transport via run-mcp-stdio.sh wrapper (venv auto-detection, no external server to manage) - Remove IMPACT_RULE from SessionStart hook — guidance now embedded in impact_analysis tool description + response (Option E) - Simplify session-start.sh: remove SSE server lifecycle management, replace with MCP_READY readiness check - Add agent constraint: prohibit Python/curl bypass of MCP tools Eval validation (ch3-phantom-filter, n=3): bare+Option E v2 delta: +2.7 (vs clean-profile +3.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speed up iterative eval by allowing: - --skip-setup: reuse pre-applied patch + dbt state (no re-setup) - --skip-teardown: keep state for subsequent parallel runs - --model: choose model (e.g., sonnet) for faster/cheaper runs Enables: setup once → parallel claude sessions → teardown once. Reduces n=3 batch time from ~30-60min to ~10-15min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval Results — Full Suite (7 scenarios, Sonnet, bare mode)Batch ID: Results
Total: 13/14 runs successful, $5.07 total cost Key Findings
Configuration
Companion PR
|
…a patterns Agent was hallucinating data quality issues (e.g., "310 completed orders with $0") because the prompt asked "any data quality issues?" without defining what constitutes an actual pipeline bug vs inherent data patterns. Changes: - Prompt now explicitly asks for PIPELINE BUGS, not general data patterns - Lists known expected patterns (placed orders, partial payments, etc.) - Scoring keywords expanded: added "corrupted", "data loss", "regression" - Judge criteria updated: evaluate bug vs pattern distinction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval Results v3 — Full Suite with Proper IsolationBatch ID:
Results
Total cost: $7.14 | 14 runs | 0 failures Key Findings
Fixes in This Batch
|
Append Recce Review workflow instructions to with-plugin prompts when MCP tools are available. Simulates real /recce-review usage path: agent uses impact_analysis as verification layer, not just generic dbt CLI analysis. Only injected when --mcp-config is set (scenarios without MCP tools like ch1-null-amounts are unaffected). Wording: "After completing your own analysis, use these tools to verify" — enriches rather than replaces the agent's primary workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
stdio MCP server holds DuckDB write lock during session, blocking dbt run/test. Scenarios with skip_context=true need the agent to run dbt itself — skip MCP config and workflow prompt injection for these scenarios. Fixes ch1-null-amounts regression (9/9 baseline → 7/9 with-plugin was caused by DuckDB lock conflict, not plugin quality). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aligns ch1 with the Recce plugin's actual value proposition — data impact review, not code fixing. Agent now uses MCP tools to analyze impact scope instead of running dbt run/test itself. Changes: - Remove skip_context: true (setup now injects dbt results) - Prompt: "review the data impact" instead of "fix the root cause" - Ground truth: all_tests_pass=false (COALESCE removal causes failures) - Remove fix_applied from expected output Note: ch1's bug causes test failures, which is a weaker test of Recce value compared to ch2/ch3 where all tests pass (silent bugs). Consider redesigning ch1 for a "tests pass but data is wrong" scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval Results v6 — MCP Field Naming + Workflow Prompt + View FixBatch ID: Results
Total cost: $5.93 | 14 runs (incl ch1-healthy-audit) | 0 failures Headline: ch3-phantom-filter +5 deltaBaseline (7/12) failed on:
Plugin (12/12) succeeded because:
This is the plugin's core value: baseline can't track DAG propagation accurately; plugin's lineage-based classification is authoritative. MCP Optimizations in This Run
Evolution
Known Issues
|
Phase 1 of eval v2 redesign (jaffle-shop-simulator environment): - eval-config.yaml: pins to commit 021e2d3, DuckDB adapter, dashboard columns - prompts/review.md: shared prompt template with stakeholder context + dashboard awareness - render-prompt.py: template variable substitution from scenario YAML + CLI overrides - R1 (tax-calculation-drift): NaN tax_rate from div-by-zero on 4,155 subtotal=0 orders - R2 (cogs-miscalculation): order_cost misses drink supply costs, 643,875 orders affected Both scenarios follow the "code review would approve" design principle — bugs are only detectable through data-level comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…chema comparison Score-deterministic: add dashboard_impact check (v2), make all_tests_pass optional (v1 compat), use has() to avoid jq // false gotcha on booleans. Run-case: build base state in prod schema before applying patch so Recce value_diff compares dev (buggy) vs prod (clean) instead of dev vs dev. R1 scenario: changed from adding new tax_rate column to modifying existing subtotal column (subtotal - tax_paid). Existing column means value_diff detects real data changes. 654,502 affected rows, dashboard_impact=true. R2 scenario: removed gross_profit column addition, kept only is_food_item filter change to order_cost. Single-hunk patch modifying existing column. Both patches stored as buggy→clean (v1 convention) for git apply --reverse. Reduced not_impacted_models to ambiguous cases only (2-3 models vs 10+). Validated: with-plugin now gets exact affected_row_count via MCP value_diff (was 0 before dual-schema fix). Baseline scores 100% due to unrestricted DuckDB SQL access — structural limitation, not a scenario design issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mode A (tool-only): with-plugin gets MCP tools only — no plugin-dir, no hooks/skills/agents, no workflow prompt. Tests raw tool value. Mode C (real-world): with-plugin gets full plugin experience including plugin-dir, hooks, and reviewer workflow prompt (default, backwards compatible). Baseline variant ignores mode — always runs vanilla Claude Code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds v2 deterministic checks (dashboard_impact, ±20% tolerance) and v2 LLM judge dimensions (Evidence Quality 0-3, Self-Verification 0-3). Preserves v1 section for backwards compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mode C (real-world) should rely on plugin hooks/skills to naturally guide the agent, not inject a [Recce Review Context] prompt that overrides the agent's judgment with "Trust MCP over code-reading". The injected prompt was causing false positives — agent over-included downstream models based on DAG classification instead of actual data impact. Without it, R2 Mode C improved from 85.7% to 100%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…scenario - Commission Spacedock workflow for eval scenario management (draft → ground-truth → eval-run → validated → integrated) - Add 3 seed scenarios from jaffle-shop-simulator issues #8, #2, #6 - Complete R8 scenario: stg_orders filters on subtotal instead of order_total — spec deviation bug (difficulty: hard) - Add assignee field to schema for multi-person coordination Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kentwelcome
left a comment
There was a problem hiding this comment.
LGTM.
By the way, add a Notion page for adding a new v2 eval scenario link
Summary
Comprehensive eval framework for measuring Recce plugin effectiveness, plus MCP tool improvements driven by eval findings.
Eval Framework (recce-eval skill)
--bare+--clean-profilefor reproducible headless runs viaclaude -prun-case.sh,score-deterministic.sh, dual-schema setup (prod/dev for value_diff)MCP Tool Improvements (companion PR: DataRecce/recce#1258)
impact_analysisreturns downstream models as "confirmed_impacted" withvalue_diff: null— agents can't distinguish DAG-reachable from data-affected → false positivesdata_impactfield (confirmed/none/potential)Eval Results (N=3, jaffle-shop-simulator + DuckDB)
Key finding: Agent tool selection is non-deterministic —
impact_analysiswas not called in any post-fix run (0/17). MCP fix is defense-in-depth; accuracy depends on agent judgment, not tool quality alone.Key Eval Insights
confirmed_impacted+value_diff: nullcreates false authorityHow Scenarios Are Designed (using jaffle-shop-simulator)
Reverse Patch Approach
jaffle-shop-simulator
mainis always correct code. Scenarios don't maintain bug branches — instead:git apply --reverseto introduce the buggit checkout --to restore clean stateDual-Schema Setup (DuckDB)
DuckDB uses a single file. With one schema, Recce compares dev vs dev → always 0 differences. So
run-case.shbuilds two schemas:Recce MCP server reads
target/(current) andtarget-base/(base) manifests, then compares actual data acrossdev.*vsprod.*tables.Scenario Design Criteria
A good scenario must satisfy all of these:
subtotal - tax_paidis valid syntaxCASE WHEN is_food_itemis a standard patternHow to Create a New Scenario
Current Scenarios
stg_orders.subtotaldouble-deducts taxorders.order_costexcludes drink supply costsis_food_itemfilter semantics; customers NOT impactedChanged Files (45 files, +2744/-78)
Eval Infrastructure
plugins/recce-dev/skills/recce-eval/— SKILL.md, scripts, scenarios, patches, promptsplugins/recce-dev/agents/eval-judge.md— LLM-as-judge agenttests/recce-eval/— Script validation testsPlugin Improvements
plugins/recce/— stdio MCP transport, Option E hook cleanup, impact_analysis workflow migrationTest plan
run-case.shargument validation tests passscore-deterministic.shscoring tests pass--dry-run🤖 Generated with Claude Code