Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion .github/workflows/eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,25 @@ jobs:
3. Process the prompt as if a user sent it (let skills auto-trigger naturally)
4. Self-evaluate your response against expected behaviors

## CRITICAL
- The skill MUST actually auto-trigger via the Skill tool
- Do NOT simulate or roleplay the skill behavior
- If the skill does not trigger, report VERDICT: FAIL

## Required Output Format
You MUST end your response with exactly one of these lines:
- `VERDICT: PASS` - if skill triggered AND all expected behaviors observed
- `VERDICT: PARTIAL` - if skill triggered but some behaviors missing
- `VERDICT: FAIL` - if skill did not trigger or wrong skill triggered

Before the verdict, explain your reasoning briefly.
claude_args: '--plugin-dir ./hope --plugin-dir ./product --plugin-dir ./wordsmith --plugin-dir ./founder --plugin-dir ./career'
plugin_marketplaces: ./
plugins: |
hope@moo.md
product@moo.md
wordsmith@moo.md
founder@moo.md
career@moo.md
show_full_output: true

- name: Check Verdict
Expand Down
5 changes: 5 additions & 0 deletions eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,8 @@ Then add to CI matrix in `.github/workflows/eval.yml`.
## Schema

See `eval/schema.json` for the structured output format.

## CI Notes

- **First-time workflow**: When adding the eval workflow via PR, it won't run on that PR due to GitHub security validation. It starts working on subsequent PRs after merge.
- **Workflow changes**: Same applies when modifying `.github/workflows/eval.yml` - changes only take effect after merge.
Loading