diff --git a/.github/workflows/eval.yml b/.github/workflows/eval.yml index 7aaf3b1..4c7fd35 100644 --- a/.github/workflows/eval.yml +++ b/.github/workflows/eval.yml @@ -46,6 +46,11 @@ jobs: 3. Process the prompt as if a user sent it (let skills auto-trigger naturally) 4. Self-evaluate your response against expected behaviors + ## CRITICAL + - The skill MUST actually auto-trigger via the Skill tool + - Do NOT simulate or roleplay the skill behavior + - If the skill does not trigger, report VERDICT: FAIL + ## Required Output Format You MUST end your response with exactly one of these lines: - `VERDICT: PASS` - if skill triggered AND all expected behaviors observed @@ -53,7 +58,13 @@ jobs: - `VERDICT: FAIL` - if skill did not trigger or wrong skill triggered Before the verdict, explain your reasoning briefly. - claude_args: '--plugin-dir ./hope --plugin-dir ./product --plugin-dir ./wordsmith --plugin-dir ./founder --plugin-dir ./career' + plugin_marketplaces: ./ + plugins: | + hope@moo.md + product@moo.md + wordsmith@moo.md + founder@moo.md + career@moo.md show_full_output: true - name: Check Verdict diff --git a/eval/README.md b/eval/README.md index 879c2e4..1122d6d 100644 --- a/eval/README.md +++ b/eval/README.md @@ -65,3 +65,8 @@ Then add to CI matrix in `.github/workflows/eval.yml`. ## Schema See `eval/schema.json` for the structured output format. + +## CI Notes + +- **First-time workflow**: When adding the eval workflow via PR, it won't run on that PR due to GitHub security validation. It starts working on subsequent PRs after merge. +- **Workflow changes**: Same applies when modifying `.github/workflows/eval.yml` - changes only take effect after merge.