Skip to content

feat(eval): Add multi-agent evaluation framework for plugin testing#2

Closed
saadshahd wants to merge 3 commits intomainfrom
feat/evaluation-framework
Closed

feat(eval): Add multi-agent evaluation framework for plugin testing#2
saadshahd wants to merge 3 commits intomainfrom
feat/evaluation-framework

Conversation

@saadshahd
Copy link
Copy Markdown
Owner

Summary

  • Introduces multi-agent evaluation system for testing moo.md plugins across model releases
  • Follows Anthropic's pattern of specialized evaluator agents (plugin-validator, skill-reviewer)
  • Phase 1 of 4-phase implementation plan

What's Included

Evaluator Agents

  • trigger-evaluator: Validates skill auto-triggering accuracy
  • orchestrator: Coordinates parallel evaluation runs

Test Cases (5)

  • hope:soul - planning tasks
  • hope:gate - completion claims
  • hope:trace - debugging/root cause
  • product:product - PRD requests
  • wordsmith:writing - edit requests

Infrastructure

  • /eval command entry point
  • run-eval.sh - Full evaluation runner
  • run-spike.sh - Architecture validation
  • Results directory with .gitignore

Initial Test Results

Test Skill Verdict
hope-trace hope:trace ✅ PASS
hope-gate hope:gate ✅ PASS
hope-soul hope:soul ❌ FAIL
wordsmith-edit wordsmith:writing ❌ FAIL
product-prd product:product ❌ FAIL

Pass Rate: 40% (2/5)

This provides valuable signal for improving skill trigger descriptions.

Test plan

  • Spike validation passed (3/3 criteria)
  • Full evaluation runs all 5 test cases
  • Evaluator agents return structured JSON
  • Report generation works
  • Manual review of failed test outputs

Next Phases

  • Phase 2: Compliance + Quality evaluators
  • Phase 3: Regression detection with baselines
  • Phase 4: CI/CD integration (GitHub Actions)

Introduces a multi-agent evaluation system for testing moo.md plugins
across model releases, following Anthropic's pattern of specialized
evaluator agents.

Phase 1 delivers:
- Trigger evaluator agent for skill auto-triggering validation
- Orchestrator agent for coordinating evaluation runs
- 5 skill trigger test cases (all plugins)
- /eval command entry point
- Shell scripts for CI/CD integration

Initial test results show 40% pass rate (2/5 skills trigger reliably),
providing valuable signal for improving skill descriptions.

Structure:
  eval/
  ├── agents/         # Evaluator agents
  ├── cases/          # Test case definitions (YAML)
  ├── scripts/        # Shell runners
  └── results/        # Run outputs (gitignored)
@saadshahd
Copy link
Copy Markdown
Owner Author

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Dec 15, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link
Copy Markdown

claude bot commented Dec 15, 2025

Code Review: Evaluation Framework

Overview

This PR introduces a well-architected multi-agent evaluation system for testing moo.md plugins. Overall solid work with some areas for improvement.


Strengths

  1. Strong Architecture - Multi-agent pattern with clear separation of concerns
  2. Good Documentation - Clear README and comprehensive /eval command docs
  3. YAML Structure - Well-organized test cases with metadata and tagging

Issues and Recommendations

1. CRITICAL: YAML Frontmatter Violation

Location: .claude/commands/eval.md, eval/agents/*.md

Issue: Agent frontmatter uses very long single-line descriptions. Per CLAUDE.md line 63, descriptions should be single focused sentences.

Recommendation: Condense descriptions from multi-sentence paragraphs to single focused lines.


2. Shell Script Robustness

File: eval/scripts/run-eval.sh

Issues:

  • Line 155-157: Script appears truncated in diff
  • Lines 66-78: Using grep/sed for YAML parsing is brittle. Consider yq for robust parsing
  • Line 91: Misleading comment about process substitution
  • Error handling: Some commands use || true which defeats set -e

3. Security Concerns

File: eval/scripts/run-eval.sh (Lines 89-95, 136-144)

Issue: Unsanitized YAML content passed directly to claude -p command.

Risk: Command injection if YAML contains shell metacharacters.

Recommendation: Use yq for proper escaping or validate/sanitize inputs.


4. Missing CHANGELOG Entry

Per CLAUDE.md:100-116: User-facing changes need CHANGELOG.md entries.

Issue: /eval command added but no CHANGELOG.md entry visible.

Recommendation: Add under [Unreleased] section.


5. File Naming Convention

Issue: eval/agents/*.md files not following standard plugin structure.

Recommendation: Either create eval/.claude-plugin/plugin.json or document as framework exception.


6. Test Coverage Gaps

Current: 5 test cases, 40% pass rate, no framework tests

Missing:

  • Meta-tests for evaluator accuracy
  • Negative tests (skills that should NOT trigger)
  • Edge cases (malformed YAML, failures)
  • Integration tests

Minor Issues

  1. Token Efficiency - Agent files verbose (155 lines). CLAUDE.md emphasizes brevity.
  2. Hardcoded Paths - Consider using git rev-parse --show-toplevel
  3. Magic Numbers - Timeout values (120000ms) not configurable
  4. Incomplete Script - run-spike.sh appears truncated at line 105

Performance Notes

  • Good parallel execution but hardcoded concurrency
  • Excellent cost tracking - consider budget warnings
  • No caching for identical prompts

Actionable Next Steps

Before Merge:

  1. Fix YAML frontmatter (condense descriptions)
  2. Add CHANGELOG.md entry
  3. Verify scripts are complete (not truncated)
  4. Consider yq for YAML parsing

Nice to Have:
5. Add negative test cases
6. Document eval framework structure
7. Add cost budget config
8. Meta-tests for evaluators


Overall Assessment

Code Quality: 7/10 - Good architecture, scripts need hardening
Security: 6/10 - Command injection risk
Test Coverage: 5/10 - Good start, missing meta-tests

Recommendation: Approve with requested changes

Address YAML frontmatter and CHANGELOG before merge. Shell improvements can be follow-up.


Questions

  1. Why not use yq for YAML parsing?
  2. Should eval be a formal plugin?
  3. Target pass rate for Phase 1?
  4. Plans for regression baselines (Phase 3)?

Great work! This framework will be valuable for plugin quality.

- Compress skill descriptions for cleaner triggering:
  - wordsmith:writing: Simplified to keyword-based triggers
  - product:product: Use specific keywords (PRD, OKRs, etc.)

- Add deferral logic to hope:soul for domain-specific skills:
  - Soul now defers to wordsmith, product, career, founder when appropriate

- Fix trigger-evaluator to detect permission_denials:
  - Permission denial = skill triggered (test infrastructure limitation)
  - Add special case for hope:soul (injected via hook, no Skill call)

- Fix test case YAML parsing (multi-line → single line prompt)

Results: 4-5/5 tests now passing (wordsmith occasionally has flaky evaluator)
@claude
Copy link
Copy Markdown

claude bot commented Dec 15, 2025

PR Review: Multi-Agent Evaluation Framework

Summary

This PR introduces a well-architected evaluation system for testing moo.md plugins. The implementation follows Anthropic's multi-agent patterns and provides valuable signal for improving skill trigger descriptions. The 40% initial pass rate (2/5 tests) is expected for Phase 1 and highlights areas for improvement.

✅ Strengths

Architecture & Design

  • Clean separation of concerns: Orchestrator coordinates, evaluators specialize, test cases are data-driven
  • Multi-agent pattern: Proper use of specialized agents (trigger-evaluator, orchestrator) with clear interfaces
  • Phased approach: Well-scoped Phase 1 with clear path to Phases 2-4
  • Progressive disclosure: Spike validation → Full eval → CI/CD integration
  • Cost awareness: Haiku for evaluators, cost tracking built-in

Code Quality

  • Robust error handling: Continues on failures, reports errors clearly
  • Shell scripting best practices:
    • set -euo pipefail for safety
    • Process substitution to avoid subshells (line 171)
    • Proper quoting and variable handling
  • JSON structured output: JSONL format for results, proper jq parsing
  • Clear documentation: README, agent descriptions, inline comments

Test Coverage

  • 5 representative test cases covering core skills (hope:soul, hope:gate, hope:trace, product:product, wordsmith:writing)
  • Permission denial handling: Smart detection that denied tool calls = successful triggering
  • Confidence thresholds: Clear 0.70 pass threshold with UNCLEAR state for edge cases

🔧 Issues & Recommendations

Critical Issues

1. CHANGELOG.md Not Updated

  • Impact: High - Per CLAUDE.md guidelines, all user-facing changes require CHANGELOG entry
  • Fix: Add entry under [Unreleased] section:
### Added
- **eval**: Multi-agent evaluation framework for testing plugin auto-triggering
- **eval**: `/eval` command with category filtering and model selection
- **eval**: Trigger-evaluator and orchestrator agents
- **eval**: 5 test cases for core skills (40% initial pass rate)
  • Location: CHANGELOG.md:10-12

2. Multi-line YAML Warning Violation

  • Impact: Medium - CLAUDE.md explicitly prohibits multi-line YAML (| or >)
  • Evidence: Test cases use single-line prompts (good), but agent descriptions might exceed 1024 chars
  • Check: trigger-evaluator.md:2 description is 157 chars ✅, orchestrator.md:2 is 154 chars ✅
  • Status: Actually compliant! But close to limit.

3. Shell YAML Parsing Fragility

  • Impact: Medium - Lines 68-74 use grep/sed/awk for YAML parsing
  • Risk: Breaks with quoted strings containing colons, nested YAML structures
  • Evidence: Line 74 tries to handle multi-line prompts but limited
  • Recommendation: Consider yq for robust YAML parsing or document limitations
  • Location: eval/scripts/run-eval.sh:68-74

Security Concerns

4. Command Injection Risk (Low)

  • Impact: Low - Test prompts are passed to claude -p via variable expansion
  • Evidence: Line 89 uses claude -p "$PROMPT" with proper quoting ✅
  • Line 117-144: EVAL_PROMPT uses heredoc properly ✅
  • Status: Safe as-is, but worth noting for future expansion

5. No Input Validation on Test Cases

  • Impact: Low-Medium - Malformed YAML could cause unexpected behavior
  • Current: Skips on parse failure (line 77-80) ✅
  • Recommendation: Add schema validation for test case YAML

Performance & Efficiency

6. Sequential Test Execution

  • Impact: Medium - Tests run sequentially (line 61 while loop)
  • Note: /eval command advertises --parallel N option (line 26) but scripts don't implement it
  • Recommendation: Phase 2 could add GNU Parallel or xargs -P for concurrent test execution
  • Location: eval/scripts/run-eval.sh:61-171, .claude/commands/eval.md:26

7. No Timeout on Claude CLI Calls

  • Impact: Low - Hung tests could block forever
  • Current: Test cases define timeout_ms: 120000 but not enforced in script
  • Recommendation: Add timeout command wrapper: timeout 120s claude -p ...

Code Quality

8. Regex Parsing Brittleness

  • Impact: Low - Lines 161-163 use grep regex to extract JSON fields
  • Better approach: jq parsing (already used elsewhere)
  • Example: Replace line 161 with:
VERDICT=$(echo "$EVAL_RESULT" | jq -r '.verdict // "UNKNOWN"')
  • Location: eval/scripts/run-eval.sh:161-163

9. Inconsistent Error Handling

  • Impact: Low - Some failures use || echo 0, others use || echo ""
  • Lines: 178-181 vs 69
  • Recommendation: Standardize fallback values

10. hope:soul Special Case Hardcoded

  • Impact: Medium - Line 135 has special case logic for hope:soul
  • Issue: Hardcoded in shell script AND not in trigger-evaluator agent
  • Risk: Duplication, maintenance burden
  • Recommendation: Move to test case metadata or agent logic
  • Location: eval/scripts/run-eval.sh:135

Documentation

11. Missing Baseline Directory

  • Impact: Low - README mentions baselines/ dir (line 28) but not created
  • Status: Phase 3 feature, reasonable to defer
  • Recommendation: Add to Phase 3 TODO or remove from README

12. Evaluator Agent Tool Access

  • Impact: Low - trigger-evaluator.md:3 declares tools: Read, Grep but doesn't use them
  • Question: Are these needed? If not, remove for clarity
  • Location: eval/agents/trigger-evaluator.md:3

🎯 Adherence to CLAUDE.md Conventions

✅ Followed Correctly

  • Kebab-case file naming (trigger-evaluator.md, hope-soul-planning.yaml)
  • Single-line frontmatter descriptions (no | or >)
  • Forward slashes only (no backslashes)
  • Reference organization (agents/ not nested)
  • Token efficiency (bullet points, clear terminology)

⚠️ Missing

  • CHANGELOG.md entry (critical requirement per CLAUDE.md:94-102)

🔒 Security Assessment

Overall Risk: LOW

  • No external input sources
  • Proper shell quoting throughout
  • No credential handling
  • Results directory properly gitignored
  • Test isolation (each test independent)

Concerns:

  • Trust boundary: Test case YAML files should be code-reviewed (low risk in controlled repo)
  • Cost control: No max budget enforcement (acceptable for Max plan)

🚀 Performance Considerations

Current:

  • 5 tests × ~10s each = ~50s total (sequential)
  • Haiku for evaluators = low cost (~$0.01 per full suite)

Future Optimizations (Phase 2):

  • Parallel test execution: 5x speedup potential
  • Caching: Reuse test outputs for multiple evaluators
  • Incremental runs: Only changed plugins

📋 Test Coverage Assessment

Current: 5 test cases

  • ✅ Core skills represented (soul, gate, trace)
  • ✅ Domain plugins sampled (product, wordsmith)
  • ❌ Missing: career, founder plugins
  • ❌ Missing: Command evaluation (only skills tested)
  • ❌ Missing: Edge cases (multi-skill scenarios, skill refusal)

Recommendation: Phase 2 should add:

  • 2-3 test cases for career/founder
  • Negative cases (skill should NOT trigger)
  • Multi-turn conversations

🐛 Potential Bugs

1. Empty CASES Variable Edge Case

  • Line 47: CASE_COUNT=$(echo "$CASES" | wc -l | tr -d ' ')
  • Issue: If $CASES is empty, wc -l returns 1 (not 0)
  • Impact: Low - Line 42 exits early on empty
  • Fix: Add | grep -c . instead of wc -l

2. TRIGGERED Boolean vs String

  • Line 163: Extracts triggered as string "unknown"
  • Line 168: Uses $TRIGGERED in JSON without quotes
  • Result: {"triggered":unknown} (invalid JSON)
  • Impact: Low - Only affects error cases
  • Fix: Add quotes or use true/false/null

3. bc Dependency

  • Line 125: Uses bc for float addition
  • Issue: bc may not be installed on all systems
  • Impact: Low - Fallback to "N/A" works
  • Fix: Use awk for portable math: awk "BEGIN {print $COST + $EVAL_COST}"

✨ Excellent Patterns Worth Highlighting

  1. Permission denial = success (trigger-evaluator.md:42-47): Clever insight that blocked tool calls prove triggering
  2. Confidence hierarchies (trigger-evaluator.md:50-58): Clear evidence → confidence mapping
  3. Skill-specific markers (trigger-evaluator.md:97-123): Maintainable marker library
  4. JSONL for results (line 57): Streamable, parseable, robust
  5. Spike validation script: Excellent Phase 0 validation approach
  6. Verdict thresholds (trigger-evaluator.md:91-95): UNCLEAR state prevents false confidence

🎓 Suggested Improvements (Future)

  1. Test case generator: Script to bootstrap new test cases from skill descriptions
  2. Diff reporting: Show what changed between model versions
  3. Flake detection: Retry failed tests to catch non-determinism
  4. HTML reports: Visual dashboards for CI/CD results
  5. Auto-fix suggestions: Failing tests suggest description improvements

Final Verdict

🟢 APPROVE with minor fixes

Confidence: 85%

This is a solid Phase 1 implementation that provides immediate value while setting up infrastructure for Phases 2-4. The architecture is sound, code quality is good, and the evaluation approach is thoughtful.

Required before merge:

  • Add CHANGELOG.md entry

Recommended before merge:

  • Fix JSON boolean bug (line 168)
  • Consider yq for YAML parsing or document limitations

Can defer to Phase 2:

  • Parallel test execution
  • Additional test coverage
  • Timeout enforcement
  • bc → awk portability fix

Key Assumption: Test case YAML files will be code-reviewed (no adversarial inputs)

Alternative Approach: Instead of bash scripts, could use Python + pytest for better YAML parsing, parallel execution, and test framework features (fixtures, parametrization, coverage). Trade-off: Added dependency vs better tooling.

Reversibility: Type 2A - Easy to remove eval/ directory and /eval command without affecting core plugins.

Complexity: 6 story points (medium) - Multi-agent coordination + shell scripting + JSON parsing + test infrastructure


Great work on this eval framework! The phased approach and early validation (spike) demonstrate solid engineering judgment. The 40% pass rate gives excellent signal for improving skill descriptions.

- Replace grep/sed with jq for JSON parsing (more robust)
- Fix triggered boolean output (was creating invalid JSON)
- Add timeout enforcement (120s for tests, 60s for evaluator)
- Remove undocumented --parallel flag from /eval command docs
@claude
Copy link
Copy Markdown

claude bot commented Dec 15, 2025

PR Review: Multi-Agent Evaluation Framework

Overview

This is a well-architected addition that introduces sophisticated testing infrastructure for validating moo.md plugin behavior. The multi-agent evaluation pattern is sound and follows good separation of concerns.

Strengths

  1. Architecture: Clean separation between orchestrator and evaluator agents
  2. Code Quality: Bash scripts follow best practices, proper error handling, good use of jq
  3. Documentation: Comprehensive README, clear command docs, follows CLAUDE.md conventions
  4. Conventions: Proper naming (kebab-case), frontmatter, forward slashes, gitignore

Critical Issues

1. CHANGELOG.md Missing (REQUIRED per CLAUDE.md)

Add entry under [Unreleased] section

2. Command Injection Risk (HIGH PRIORITY)

Location: eval/scripts/run-eval.sh:117-144
Issue: Unquoted variable expansion in EVAL_PROMPT could allow command injection
Recommendation: Use proper escaping or safer prompt construction

3. YAML Parsing Fragility (MEDIUM)

Location: eval/scripts/run-eval.sh:68-75
Issue: grep/sed/awk approach won't handle all multi-line YAML correctly
Recommendation: Consider using yq for robust YAML parsing

4. Silent Error Handling (MEDIUM)

Location: eval/scripts/run-eval.sh:161-163
Issue: jq errors suppressed with 2>/dev/null
Recommendation: Log errors before falling back

Test Coverage

Current: 5 test cases for skill triggering (appropriate for Phase 1)
Future: Workflow compliance, output quality, negative cases, edge cases

Performance

Sequential execution (~10min for 5 tests) acceptable for Phase 1. Consider parallel execution in Phase 2.

Recommended Actions Before Merge

MUST FIX:

  1. Add CHANGELOG.md entry
  2. Fix command injection in run-eval.sh:117-144

SHOULD FIX:
3. Improve YAML parsing robustness
4. Better error logging

NICE TO HAVE:
5. Test case validation script
6. Cost estimation in docs

Final Verdict

Approve with required changes. Solid foundational work with sound architecture and high code quality.

Confidence: 85% reliable after fixing command injection
Complexity: 8 story points

Great work! This framework will be valuable for maintaining plugin quality.

@saadshahd
Copy link
Copy Markdown
Owner Author

Superseded

Valuable parts extracted to separate PRs:

PR Content
#4 Skill description improvements (cherry-picked)
archive/eval-test-cases branch Test cases preserved for future promptfoo migration

What was removed

  • Shell scripts (run-eval.sh, run-spike.sh) - overly complex, duplicates promptfoo
  • Evaluator agents - coupled to shell infrastructure
  • /eval command - depends on removed scripts

The test cases are preserved in archive/eval-test-cases branch for future conversion to promptfoo format.

@saadshahd saadshahd closed this Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant