feat(eval): Add multi-agent evaluation framework for plugin testing by saadshahd · Pull Request #2 · saadshahd/moo.md

saadshahd · 2025-12-15T19:33:01Z

Summary

Introduces multi-agent evaluation system for testing moo.md plugins across model releases
Follows Anthropic's pattern of specialized evaluator agents (plugin-validator, skill-reviewer)
Phase 1 of 4-phase implementation plan

What's Included

Evaluator Agents

trigger-evaluator: Validates skill auto-triggering accuracy
orchestrator: Coordinates parallel evaluation runs

Test Cases (5)

hope:soul - planning tasks
hope:gate - completion claims
hope:trace - debugging/root cause
product:product - PRD requests
wordsmith:writing - edit requests

Infrastructure

/eval command entry point
run-eval.sh - Full evaluation runner
run-spike.sh - Architecture validation
Results directory with .gitignore

Initial Test Results

Test	Skill	Verdict
hope-trace	hope:trace	✅ PASS
hope-gate	hope:gate	✅ PASS
hope-soul	hope:soul	❌ FAIL
wordsmith-edit	wordsmith:writing	❌ FAIL
product-prd	product:product	❌ FAIL

Pass Rate: 40% (2/5)

This provides valuable signal for improving skill trigger descriptions.

Test plan

Spike validation passed (3/3 criteria)
Full evaluation runs all 5 test cases
Evaluator agents return structured JSON
Report generation works
Manual review of failed test outputs

Next Phases

Phase 2: Compliance + Quality evaluators
Phase 3: Regression detection with baselines
Phase 4: CI/CD integration (GitHub Actions)

Introduces a multi-agent evaluation system for testing moo.md plugins across model releases, following Anthropic's pattern of specialized evaluator agents. Phase 1 delivers: - Trigger evaluator agent for skill auto-triggering validation - Orchestrator agent for coordinating evaluation runs - 5 skill trigger test cases (all plugins) - /eval command entry point - Shell scripts for CI/CD integration Initial test results show 40% pass rate (2/5 skills trigger reliably), providing valuable signal for improving skill descriptions. Structure: eval/ ├── agents/ # Evaluator agents ├── cases/ # Test case definitions (YAML) ├── scripts/ # Shell runners └── results/ # Run outputs (gitignored)

saadshahd · 2025-12-15T19:36:08Z

@claude review

claude · 2025-12-15T19:36:23Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

claude · 2025-12-15T19:37:42Z

Code Review: Evaluation Framework

Overview

This PR introduces a well-architected multi-agent evaluation system for testing moo.md plugins. Overall solid work with some areas for improvement.

Strengths

Strong Architecture - Multi-agent pattern with clear separation of concerns
Good Documentation - Clear README and comprehensive /eval command docs
YAML Structure - Well-organized test cases with metadata and tagging

Issues and Recommendations

1. CRITICAL: YAML Frontmatter Violation

Location: .claude/commands/eval.md, eval/agents/*.md

Issue: Agent frontmatter uses very long single-line descriptions. Per CLAUDE.md line 63, descriptions should be single focused sentences.

Recommendation: Condense descriptions from multi-sentence paragraphs to single focused lines.

2. Shell Script Robustness

File: eval/scripts/run-eval.sh

Issues:

Line 155-157: Script appears truncated in diff
Lines 66-78: Using grep/sed for YAML parsing is brittle. Consider yq for robust parsing
Line 91: Misleading comment about process substitution
Error handling: Some commands use || true which defeats set -e

3. Security Concerns

File: eval/scripts/run-eval.sh (Lines 89-95, 136-144)

Issue: Unsanitized YAML content passed directly to claude -p command.

Risk: Command injection if YAML contains shell metacharacters.

Recommendation: Use yq for proper escaping or validate/sanitize inputs.

4. Missing CHANGELOG Entry

Per CLAUDE.md:100-116: User-facing changes need CHANGELOG.md entries.

Issue: /eval command added but no CHANGELOG.md entry visible.

Recommendation: Add under [Unreleased] section.

5. File Naming Convention

Issue: eval/agents/*.md files not following standard plugin structure.

Recommendation: Either create eval/.claude-plugin/plugin.json or document as framework exception.

6. Test Coverage Gaps

Current: 5 test cases, 40% pass rate, no framework tests

Missing:

Meta-tests for evaluator accuracy
Negative tests (skills that should NOT trigger)
Edge cases (malformed YAML, failures)
Integration tests

Minor Issues

Token Efficiency - Agent files verbose (155 lines). CLAUDE.md emphasizes brevity.
Hardcoded Paths - Consider using git rev-parse --show-toplevel
Magic Numbers - Timeout values (120000ms) not configurable
Incomplete Script - run-spike.sh appears truncated at line 105

Performance Notes

Good parallel execution but hardcoded concurrency
Excellent cost tracking - consider budget warnings
No caching for identical prompts

Actionable Next Steps

Before Merge:

Fix YAML frontmatter (condense descriptions)
Add CHANGELOG.md entry
Verify scripts are complete (not truncated)
Consider yq for YAML parsing

Nice to Have:
5. Add negative test cases
6. Document eval framework structure
7. Add cost budget config
8. Meta-tests for evaluators

Overall Assessment

Code Quality: 7/10 - Good architecture, scripts need hardening
Security: 6/10 - Command injection risk
Test Coverage: 5/10 - Good start, missing meta-tests

Recommendation: Approve with requested changes

Address YAML frontmatter and CHANGELOG before merge. Shell improvements can be follow-up.

Questions

Why not use yq for YAML parsing?
Should eval be a formal plugin?
Target pass rate for Phase 1?
Plans for regression baselines (Phase 3)?

Great work! This framework will be valuable for plugin quality.

- Compress skill descriptions for cleaner triggering: - wordsmith:writing: Simplified to keyword-based triggers - product:product: Use specific keywords (PRD, OKRs, etc.) - Add deferral logic to hope:soul for domain-specific skills: - Soul now defers to wordsmith, product, career, founder when appropriate - Fix trigger-evaluator to detect permission_denials: - Permission denial = skill triggered (test infrastructure limitation) - Add special case for hope:soul (injected via hook, no Skill call) - Fix test case YAML parsing (multi-line → single line prompt) Results: 4-5/5 tests now passing (wordsmith occasionally has flaky evaluator)

claude · 2025-12-15T20:35:01Z

PR Review: Multi-Agent Evaluation Framework

Summary

This PR introduces a well-architected evaluation system for testing moo.md plugins. The implementation follows Anthropic's multi-agent patterns and provides valuable signal for improving skill trigger descriptions. The 40% initial pass rate (2/5 tests) is expected for Phase 1 and highlights areas for improvement.

✅ Strengths

Architecture & Design

Clean separation of concerns: Orchestrator coordinates, evaluators specialize, test cases are data-driven
Multi-agent pattern: Proper use of specialized agents (trigger-evaluator, orchestrator) with clear interfaces
Phased approach: Well-scoped Phase 1 with clear path to Phases 2-4
Progressive disclosure: Spike validation → Full eval → CI/CD integration
Cost awareness: Haiku for evaluators, cost tracking built-in

Code Quality

Robust error handling: Continues on failures, reports errors clearly
Shell scripting best practices:
- set -euo pipefail for safety
- Process substitution to avoid subshells (line 171)
- Proper quoting and variable handling
JSON structured output: JSONL format for results, proper jq parsing
Clear documentation: README, agent descriptions, inline comments

Test Coverage

5 representative test cases covering core skills (hope:soul, hope:gate, hope:trace, product:product, wordsmith:writing)
Permission denial handling: Smart detection that denied tool calls = successful triggering
Confidence thresholds: Clear 0.70 pass threshold with UNCLEAR state for edge cases

🔧 Issues & Recommendations

Critical Issues

1. CHANGELOG.md Not Updated

Impact: High - Per CLAUDE.md guidelines, all user-facing changes require CHANGELOG entry
Fix: Add entry under [Unreleased] section:

### Added
- **eval**: Multi-agent evaluation framework for testing plugin auto-triggering
- **eval**: `/eval` command with category filtering and model selection
- **eval**: Trigger-evaluator and orchestrator agents
- **eval**: 5 test cases for core skills (40% initial pass rate)

Location: CHANGELOG.md:10-12

2. Multi-line YAML Warning Violation

Impact: Medium - CLAUDE.md explicitly prohibits multi-line YAML (| or >)
Evidence: Test cases use single-line prompts (good), but agent descriptions might exceed 1024 chars
Check: trigger-evaluator.md:2 description is 157 chars ✅, orchestrator.md:2 is 154 chars ✅
Status: Actually compliant! But close to limit.

3. Shell YAML Parsing Fragility

Impact: Medium - Lines 68-74 use grep/sed/awk for YAML parsing
Risk: Breaks with quoted strings containing colons, nested YAML structures
Evidence: Line 74 tries to handle multi-line prompts but limited
Recommendation: Consider yq for robust YAML parsing or document limitations
Location: eval/scripts/run-eval.sh:68-74

Security Concerns

4. Command Injection Risk (Low)

Impact: Low - Test prompts are passed to claude -p via variable expansion
Evidence: Line 89 uses claude -p "$PROMPT" with proper quoting ✅
Line 117-144: EVAL_PROMPT uses heredoc properly ✅
Status: Safe as-is, but worth noting for future expansion

5. No Input Validation on Test Cases

Impact: Low-Medium - Malformed YAML could cause unexpected behavior
Current: Skips on parse failure (line 77-80) ✅
Recommendation: Add schema validation for test case YAML

Performance & Efficiency

6. Sequential Test Execution

Impact: Medium - Tests run sequentially (line 61 while loop)
Note: /eval command advertises --parallel N option (line 26) but scripts don't implement it
Recommendation: Phase 2 could add GNU Parallel or xargs -P for concurrent test execution
Location: eval/scripts/run-eval.sh:61-171, .claude/commands/eval.md:26

7. No Timeout on Claude CLI Calls

Impact: Low - Hung tests could block forever
Current: Test cases define timeout_ms: 120000 but not enforced in script
Recommendation: Add timeout command wrapper: timeout 120s claude -p ...

Code Quality

8. Regex Parsing Brittleness

Impact: Low - Lines 161-163 use grep regex to extract JSON fields
Better approach: jq parsing (already used elsewhere)
Example: Replace line 161 with:

VERDICT=$(echo "$EVAL_RESULT" | jq -r '.verdict // "UNKNOWN"')

Location: eval/scripts/run-eval.sh:161-163

9. Inconsistent Error Handling

Impact: Low - Some failures use || echo 0, others use || echo ""
Lines: 178-181 vs 69
Recommendation: Standardize fallback values

10. hope:soul Special Case Hardcoded

Impact: Medium - Line 135 has special case logic for hope:soul
Issue: Hardcoded in shell script AND not in trigger-evaluator agent
Risk: Duplication, maintenance burden
Recommendation: Move to test case metadata or agent logic
Location: eval/scripts/run-eval.sh:135

Documentation

11. Missing Baseline Directory

Impact: Low - README mentions baselines/ dir (line 28) but not created
Status: Phase 3 feature, reasonable to defer
Recommendation: Add to Phase 3 TODO or remove from README

12. Evaluator Agent Tool Access

Impact: Low - trigger-evaluator.md:3 declares tools: Read, Grep but doesn't use them
Question: Are these needed? If not, remove for clarity
Location: eval/agents/trigger-evaluator.md:3

🎯 Adherence to CLAUDE.md Conventions

✅ Followed Correctly

Kebab-case file naming (trigger-evaluator.md, hope-soul-planning.yaml)
Single-line frontmatter descriptions (no | or >)
Forward slashes only (no backslashes)
Reference organization (agents/ not nested)
Token efficiency (bullet points, clear terminology)

⚠️ Missing

CHANGELOG.md entry (critical requirement per CLAUDE.md:94-102)

🔒 Security Assessment

Overall Risk: LOW

No external input sources
Proper shell quoting throughout
No credential handling
Results directory properly gitignored
Test isolation (each test independent)

Concerns:

Trust boundary: Test case YAML files should be code-reviewed (low risk in controlled repo)
Cost control: No max budget enforcement (acceptable for Max plan)

🚀 Performance Considerations

Current:

5 tests × ~10s each = ~50s total (sequential)
Haiku for evaluators = low cost (~$0.01 per full suite)

Future Optimizations (Phase 2):

Parallel test execution: 5x speedup potential
Caching: Reuse test outputs for multiple evaluators
Incremental runs: Only changed plugins

📋 Test Coverage Assessment

Current: 5 test cases

✅ Core skills represented (soul, gate, trace)
✅ Domain plugins sampled (product, wordsmith)
❌ Missing: career, founder plugins
❌ Missing: Command evaluation (only skills tested)
❌ Missing: Edge cases (multi-skill scenarios, skill refusal)

Recommendation: Phase 2 should add:

2-3 test cases for career/founder
Negative cases (skill should NOT trigger)
Multi-turn conversations

🐛 Potential Bugs

1. Empty CASES Variable Edge Case

Line 47: CASE_COUNT=$(echo "$CASES" | wc -l | tr -d ' ')
Issue: If $CASES is empty, wc -l returns 1 (not 0)
Impact: Low - Line 42 exits early on empty
Fix: Add | grep -c . instead of wc -l

2. TRIGGERED Boolean vs String

Line 163: Extracts triggered as string "unknown"
Line 168: Uses $TRIGGERED in JSON without quotes
Result: {"triggered":unknown} (invalid JSON)
Impact: Low - Only affects error cases
Fix: Add quotes or use true/false/null

3. bc Dependency

Line 125: Uses bc for float addition
Issue: bc may not be installed on all systems
Impact: Low - Fallback to "N/A" works
Fix: Use awk for portable math: awk "BEGIN {print $COST + $EVAL_COST}"

✨ Excellent Patterns Worth Highlighting

Permission denial = success (trigger-evaluator.md:42-47): Clever insight that blocked tool calls prove triggering
Confidence hierarchies (trigger-evaluator.md:50-58): Clear evidence → confidence mapping
Skill-specific markers (trigger-evaluator.md:97-123): Maintainable marker library
JSONL for results (line 57): Streamable, parseable, robust
Spike validation script: Excellent Phase 0 validation approach
Verdict thresholds (trigger-evaluator.md:91-95): UNCLEAR state prevents false confidence

🎓 Suggested Improvements (Future)

Test case generator: Script to bootstrap new test cases from skill descriptions
Diff reporting: Show what changed between model versions
Flake detection: Retry failed tests to catch non-determinism
HTML reports: Visual dashboards for CI/CD results
Auto-fix suggestions: Failing tests suggest description improvements

Final Verdict

🟢 APPROVE with minor fixes

Confidence: 85%

This is a solid Phase 1 implementation that provides immediate value while setting up infrastructure for Phases 2-4. The architecture is sound, code quality is good, and the evaluation approach is thoughtful.

Required before merge:

Add CHANGELOG.md entry

Recommended before merge:

Fix JSON boolean bug (line 168)
Consider yq for YAML parsing or document limitations

Can defer to Phase 2:

Parallel test execution
Additional test coverage
Timeout enforcement
bc → awk portability fix

Key Assumption: Test case YAML files will be code-reviewed (no adversarial inputs)

Alternative Approach: Instead of bash scripts, could use Python + pytest for better YAML parsing, parallel execution, and test framework features (fixtures, parametrization, coverage). Trade-off: Added dependency vs better tooling.

Reversibility: Type 2A - Easy to remove eval/ directory and /eval command without affecting core plugins.

Complexity: 6 story points (medium) - Multi-agent coordination + shell scripting + JSON parsing + test infrastructure

Great work on this eval framework! The phased approach and early validation (spike) demonstrate solid engineering judgment. The 40% pass rate gives excellent signal for improving skill descriptions.

- Replace grep/sed with jq for JSON parsing (more robust) - Fix triggered boolean output (was creating invalid JSON) - Add timeout enforcement (120s for tests, 60s for evaluator) - Remove undocumented --parallel flag from /eval command docs

claude · 2025-12-15T20:54:00Z

PR Review: Multi-Agent Evaluation Framework

Overview

This is a well-architected addition that introduces sophisticated testing infrastructure for validating moo.md plugin behavior. The multi-agent evaluation pattern is sound and follows good separation of concerns.

Strengths

Architecture: Clean separation between orchestrator and evaluator agents
Code Quality: Bash scripts follow best practices, proper error handling, good use of jq
Documentation: Comprehensive README, clear command docs, follows CLAUDE.md conventions
Conventions: Proper naming (kebab-case), frontmatter, forward slashes, gitignore

Critical Issues

1. CHANGELOG.md Missing (REQUIRED per CLAUDE.md)

Add entry under [Unreleased] section

2. Command Injection Risk (HIGH PRIORITY)

Location: eval/scripts/run-eval.sh:117-144
Issue: Unquoted variable expansion in EVAL_PROMPT could allow command injection
Recommendation: Use proper escaping or safer prompt construction

3. YAML Parsing Fragility (MEDIUM)

Location: eval/scripts/run-eval.sh:68-75
Issue: grep/sed/awk approach won't handle all multi-line YAML correctly
Recommendation: Consider using yq for robust YAML parsing

4. Silent Error Handling (MEDIUM)

Location: eval/scripts/run-eval.sh:161-163
Issue: jq errors suppressed with 2>/dev/null
Recommendation: Log errors before falling back

Test Coverage

Current: 5 test cases for skill triggering (appropriate for Phase 1)
Future: Workflow compliance, output quality, negative cases, edge cases

Performance

Sequential execution (~10min for 5 tests) acceptable for Phase 1. Consider parallel execution in Phase 2.

Recommended Actions Before Merge

MUST FIX:

Add CHANGELOG.md entry
Fix command injection in run-eval.sh:117-144

SHOULD FIX:
3. Improve YAML parsing robustness
4. Better error logging

NICE TO HAVE:
5. Test case validation script
6. Cost estimation in docs

Final Verdict

Approve with required changes. Solid foundational work with sound architecture and high code quality.

Confidence: 85% reliable after fixing command injection
Complexity: 8 story points

Great work! This framework will be valuable for maintaining plugin quality.

saadshahd · 2025-12-16T18:57:40Z

Superseded

Valuable parts extracted to separate PRs:

PR	Content
#4	Skill description improvements (cherry-picked)
`archive/eval-test-cases` branch	Test cases preserved for future promptfoo migration

What was removed

Shell scripts (run-eval.sh, run-spike.sh) - overly complex, duplicates promptfoo
Evaluator agents - coupled to shell infrastructure
/eval command - depends on removed scripts

The test cases are preserved in archive/eval-test-cases branch for future conversion to promptfoo format.

fix(eval): Address PR review feedback

f1ab28c

- Replace grep/sed with jq for JSON parsing (more robust) - Fix triggered boolean output (was creating invalid JSON) - Add timeout enforcement (120s for tests, 60s for evaluator) - Remove undocumented --parallel flag from /eval command docs

saadshahd force-pushed the main branch 9 times, most recently from a4df3e7 to a2b03eb Compare December 16, 2025 14:16

saadshahd mentioned this pull request Dec 16, 2025

fix(skills): Improve trigger descriptions for better auto-activation #4

Merged

2 tasks

saadshahd closed this Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Add multi-agent evaluation framework for plugin testing#2

feat(eval): Add multi-agent evaluation framework for plugin testing#2
saadshahd wants to merge 3 commits intomainfrom
feat/evaluation-framework

saadshahd commented Dec 15, 2025

Uh oh!

saadshahd commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025

Uh oh!

saadshahd commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saadshahd commented Dec 15, 2025

Summary

What's Included

Evaluator Agents

Test Cases (5)

Infrastructure

Initial Test Results

Test plan

Next Phases

Uh oh!

saadshahd commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Dec 15, 2025

Code Review: Evaluation Framework

Overview

Strengths

Issues and Recommendations

1. CRITICAL: YAML Frontmatter Violation

2. Shell Script Robustness

3. Security Concerns

4. Missing CHANGELOG Entry

5. File Naming Convention

6. Test Coverage Gaps

Minor Issues

Performance Notes

Actionable Next Steps

Overall Assessment

Questions

Uh oh!

claude bot commented Dec 15, 2025

PR Review: Multi-Agent Evaluation Framework

Summary

✅ Strengths

Architecture & Design

Code Quality

Test Coverage

🔧 Issues & Recommendations

Critical Issues

Security Concerns

Performance & Efficiency

Code Quality

Documentation

🎯 Adherence to CLAUDE.md Conventions

✅ Followed Correctly

⚠️ Missing

🔒 Security Assessment

🚀 Performance Considerations

📋 Test Coverage Assessment

🐛 Potential Bugs

✨ Excellent Patterns Worth Highlighting

🎓 Suggested Improvements (Future)

Final Verdict

Uh oh!

claude bot commented Dec 15, 2025

PR Review: Multi-Agent Evaluation Framework

Overview

Strengths

Critical Issues

1. CHANGELOG.md Missing (REQUIRED per CLAUDE.md)

2. Command Injection Risk (HIGH PRIORITY)

3. YAML Parsing Fragility (MEDIUM)

4. Silent Error Handling (MEDIUM)

Test Coverage

Performance

Recommended Actions Before Merge

Final Verdict

Uh oh!

saadshahd commented Dec 16, 2025

Superseded

What was removed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

claude bot commented Dec 15, 2025 •

edited

Loading