feat(eval): Add multi-agent evaluation framework for plugin testing#2
feat(eval): Add multi-agent evaluation framework for plugin testing#2
Conversation
Introduces a multi-agent evaluation system for testing moo.md plugins across model releases, following Anthropic's pattern of specialized evaluator agents. Phase 1 delivers: - Trigger evaluator agent for skill auto-triggering validation - Orchestrator agent for coordinating evaluation runs - 5 skill trigger test cases (all plugins) - /eval command entry point - Shell scripts for CI/CD integration Initial test results show 40% pass rate (2/5 skills trigger reliably), providing valuable signal for improving skill descriptions. Structure: eval/ ├── agents/ # Evaluator agents ├── cases/ # Test case definitions (YAML) ├── scripts/ # Shell runners └── results/ # Run outputs (gitignored)
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Code Review: Evaluation FrameworkOverviewThis PR introduces a well-architected multi-agent evaluation system for testing moo.md plugins. Overall solid work with some areas for improvement. Strengths
Issues and Recommendations1. CRITICAL: YAML Frontmatter ViolationLocation: .claude/commands/eval.md, eval/agents/*.md Issue: Agent frontmatter uses very long single-line descriptions. Per CLAUDE.md line 63, descriptions should be single focused sentences. Recommendation: Condense descriptions from multi-sentence paragraphs to single focused lines. 2. Shell Script RobustnessFile: eval/scripts/run-eval.sh Issues:
3. Security ConcernsFile: eval/scripts/run-eval.sh (Lines 89-95, 136-144) Issue: Unsanitized YAML content passed directly to claude -p command. Risk: Command injection if YAML contains shell metacharacters. Recommendation: Use yq for proper escaping or validate/sanitize inputs. 4. Missing CHANGELOG EntryPer CLAUDE.md:100-116: User-facing changes need CHANGELOG.md entries. Issue: /eval command added but no CHANGELOG.md entry visible. Recommendation: Add under [Unreleased] section. 5. File Naming ConventionIssue: eval/agents/*.md files not following standard plugin structure. Recommendation: Either create eval/.claude-plugin/plugin.json or document as framework exception. 6. Test Coverage GapsCurrent: 5 test cases, 40% pass rate, no framework tests Missing:
Minor Issues
Performance Notes
Actionable Next StepsBefore Merge:
Nice to Have: Overall AssessmentCode Quality: 7/10 - Good architecture, scripts need hardening Recommendation: Approve with requested changes Address YAML frontmatter and CHANGELOG before merge. Shell improvements can be follow-up. Questions
Great work! This framework will be valuable for plugin quality. |
- Compress skill descriptions for cleaner triggering: - wordsmith:writing: Simplified to keyword-based triggers - product:product: Use specific keywords (PRD, OKRs, etc.) - Add deferral logic to hope:soul for domain-specific skills: - Soul now defers to wordsmith, product, career, founder when appropriate - Fix trigger-evaluator to detect permission_denials: - Permission denial = skill triggered (test infrastructure limitation) - Add special case for hope:soul (injected via hook, no Skill call) - Fix test case YAML parsing (multi-line → single line prompt) Results: 4-5/5 tests now passing (wordsmith occasionally has flaky evaluator)
PR Review: Multi-Agent Evaluation FrameworkSummaryThis PR introduces a well-architected evaluation system for testing moo.md plugins. The implementation follows Anthropic's multi-agent patterns and provides valuable signal for improving skill trigger descriptions. The 40% initial pass rate (2/5 tests) is expected for Phase 1 and highlights areas for improvement. ✅ StrengthsArchitecture & Design
Code Quality
Test Coverage
🔧 Issues & RecommendationsCritical Issues1. CHANGELOG.md Not Updated
### Added
- **eval**: Multi-agent evaluation framework for testing plugin auto-triggering
- **eval**: `/eval` command with category filtering and model selection
- **eval**: Trigger-evaluator and orchestrator agents
- **eval**: 5 test cases for core skills (40% initial pass rate)
2. Multi-line YAML Warning Violation
3. Shell YAML Parsing Fragility
Security Concerns4. Command Injection Risk (Low)
5. No Input Validation on Test Cases
Performance & Efficiency6. Sequential Test Execution
7. No Timeout on Claude CLI Calls
Code Quality8. Regex Parsing Brittleness
VERDICT=$(echo "$EVAL_RESULT" | jq -r '.verdict // "UNKNOWN"')
9. Inconsistent Error Handling
10. hope:soul Special Case Hardcoded
Documentation11. Missing Baseline Directory
12. Evaluator Agent Tool Access
🎯 Adherence to CLAUDE.md Conventions✅ Followed Correctly
|
- Replace grep/sed with jq for JSON parsing (more robust) - Fix triggered boolean output (was creating invalid JSON) - Add timeout enforcement (120s for tests, 60s for evaluator) - Remove undocumented --parallel flag from /eval command docs
PR Review: Multi-Agent Evaluation FrameworkOverviewThis is a well-architected addition that introduces sophisticated testing infrastructure for validating moo.md plugin behavior. The multi-agent evaluation pattern is sound and follows good separation of concerns. Strengths
Critical Issues1. CHANGELOG.md Missing (REQUIRED per CLAUDE.md)Add entry under [Unreleased] section 2. Command Injection Risk (HIGH PRIORITY)Location: eval/scripts/run-eval.sh:117-144 3. YAML Parsing Fragility (MEDIUM)Location: eval/scripts/run-eval.sh:68-75 4. Silent Error Handling (MEDIUM)Location: eval/scripts/run-eval.sh:161-163 Test CoverageCurrent: 5 test cases for skill triggering (appropriate for Phase 1) PerformanceSequential execution (~10min for 5 tests) acceptable for Phase 1. Consider parallel execution in Phase 2. Recommended Actions Before MergeMUST FIX:
SHOULD FIX: NICE TO HAVE: Final VerdictApprove with required changes. Solid foundational work with sound architecture and high code quality. Confidence: 85% reliable after fixing command injection Great work! This framework will be valuable for maintaining plugin quality. |
a4df3e7 to
a2b03eb
Compare
SupersededValuable parts extracted to separate PRs:
What was removed
The test cases are preserved in |
Summary
What's Included
Evaluator Agents
Test Cases (5)
hope:soul- planning taskshope:gate- completion claimshope:trace- debugging/root causeproduct:product- PRD requestswordsmith:writing- edit requestsInfrastructure
/evalcommand entry pointrun-eval.sh- Full evaluation runnerrun-spike.sh- Architecture validationInitial Test Results
Pass Rate: 40% (2/5)
This provides valuable signal for improving skill trigger descriptions.
Test plan
Next Phases