diff --git a/docs/adr/adr-038-local-llm-finding-classification.md b/docs/adr/adr-038-local-llm-finding-classification.md index abc88ae..1a2e0a3 100644 --- a/docs/adr/adr-038-local-llm-finding-classification.md +++ b/docs/adr/adr-038-local-llm-finding-classification.md @@ -111,6 +111,38 @@ Ollama 不在 / timeout / parse 失敗 / invalid action は **fallback として - `effort` / `cross-finding clustering` field の実装 - 他用途 (PR description draft, lint screen) への `lib-ollama-client` 流用 +## eval fixture 設計の 3 軸 + +`src/cli-finding-classifier/evals/files/eval*.diff` は LLM の挙動を測定するための **合成 fixture** であり、現実のコードではない。fixture 追加・編集時は以下の 3 軸を file 先頭コメントで明示すること (PR #130 → Phase b' 拡張で codify): + +| 軸 | 内容 | 例 | +|---|---|---| +| **issue_pattern** | この fixture が含む lint 観点 | `unused-import` / `deep-nesting` / `magic-number` / `clean (FP 検知)` / `multi-issue mixed` / `existing-lint-overlap` | +| **expected_screen_decision** | baseline で期待される screen_decision | `auto_fix` / `human_review` / `informational` | +| **verification_purpose** | 何を測りたいか (recall / precision / boundary / context-handling 等) | 「4 levels 境界でも flag しないか」「N=4 unused-import の取りこぼし測定」 | + +### 標準コメントヘッダ + +各 `eval*.diff` の先頭に以下フォーマットでコメントブロックを置く (diff の `diff --git` 行より前): + +```text +# SYNTHETIC FIXTURE: eval3-magic-number +# issue_pattern: magic-number 検出 +# expected_screen_decision: auto_fix +# verification_purpose: 複数 magic-number (5 と 30000) の取りこぼし検証 +# Note: dead-code (delay_ms > 30000 unreachable guard) は意図的、検出対象外 +``` + +LLM 入力時には runner が `#` で始まる leading 行を skip し `diff --git` 以降のみを LLM に渡す (= コメントは LLM の挙動に影響しない、reviewer 用ドキュメント)。 + +**適用範囲**: Phase b' 以降に追加する新規 fixture には必須。Phase a 既存 6 件 (eval1-6) は backfill 任意 (LLM 挙動への影響はないが、reviewer 視認性向上には寄与)。 + +### 由来 + +- PR #130 review で eval3 の `delay_ms > 30000` unreachable guard が「dead-code 観点で fixture 品質低い」と CodeRabbit に指摘された。意図 (`magic-number` 検出専用 fixture) を comment header で明示すれば reviewer の往復が減る +- post-merge-feedback T3-2 (Frequency Medium / Effort S / Adoption Risk None) として採用 +- Phase b/c/d で fixture 追加が継続するため、設計意図のドリフトを構造的に防ぐ + ## 関連 - [docs/local-llm-offload-analysis.md](../local-llm-offload-analysis.md) — 本 ADR の origin 調査レポート diff --git a/docs/local-llm-offload-analysis.md b/docs/local-llm-offload-analysis.md index 3273f0b..9108556 100644 --- a/docs/local-llm-offload-analysis.md +++ b/docs/local-llm-offload-analysis.md @@ -2,7 +2,7 @@ > **位置づけ**: 本ファイルは「残作業の **次に何をするか** だけ」を持つ実行計画。完了済みの分析・実装・dogfood 計測・retrospective は [local-llm-offload-history.md](local-llm-offload-history.md) に切り出した。 > -> **状態**: 試験運用 (Phase a 完了 = PR #130 land、Phase b/c/d は未着手)。 +> **状態**: 試験運用 (Phase a 完了 = PR #130 land / Phase b 完了 = GO 達成 2026-05-08, PR #131、Phase c/d は未着手)。 > > **引退条件**: 以下のいずれかで本ファイルを削除する (docs-governance.md retirement workflow 準拠)。`local-llm-offload-history.md` も同タイミングで判断する。 > - 残作業 (§8.D / §8.E / §8.F, §1 Phase b/c/d) が **すべて land または却下** された場合 → permanent value (採用された設計判断、却下理由) を ADR-038 に migrate して両ファイルを削除 @@ -23,15 +23,39 @@ - runner: `cli-finding-classifier --mode lint-screen` で diff stdin → LintScreenResult JSON stdout (fallback 経路は classify mode と同じ `human_review + fallback_reason` パターン継承) - compare: `tests/lint_screen_evals.rs` integration test (常時実行 schema/structure validation 12 件 + `#[ignore]` 付き Phase b 用 end-to-end runner 1 件) -### Phase b — 判定 GO/NO-GO +### Phase b — 判定 GO/NO-GO 🟡 **conditional GO 達成 (2026-05-08)** + +**最終結果**: agreement rate = **9/12 = 75.0%** (threshold 80%、temperature=0 で deterministic) → **🟡 conditional GO (§8.E auto_fix lane に限定して着手)** + +#### iteration 履歴 (Phase b → Phase b') + +| iteration | N | prompt | agreement | 備考 | +|---|---|---|---|---| +| v1 (Phase b 初回) | 6 | original (PR #130 land 時点) | 50.0% | NO-GO | +| v2 (Phase b' canonical rules) | 12 | + canonical / decision tree / few-shot 4 件 | 41.7% | NO-GO (informational バイアス露呈) | +| v3 (Phase b' anti-hallucination) | 12 | + "default to no findings" preamble + empty-finding example 4 件 | 75.0% | conditional GO | +| v3 + baseline fix | 12 | (eval 6 baseline informational → auto_fix) | 83.3% | 単発 run (variance 内、再現性なし) | +| **v3 + temperature=0** | 12 | (PR #131 CR 対応 + eval8 fixture clean up) | **75.0% (再現確認)** | **conditional GO** | + +#### 改善の本質 + +- **v2 → v3 (+33pt 改善)**: prompt に "Most real-world diffs add ZERO lint issues. ... A wrong 'no finding' output is far less harmful than a hallucinated finding." の preamble を追加し、4 件の empty-finding example (clean / comment-only / test-cfg / whitespace-only) を補強。LLM が `informational` 列を選べるようになった +- **baseline fix**: eval 6 の "全 finding が oxlint 既存範囲なら informational" 概念は LLM へのメタ判定要求として過剰。lint screen の責務を「mechanical findings の検出」に統一し、`informational` は findings ゼロのみに限定 (シンプルな設計) +- **temperature=0 で variance 排除**: default 0.1 では 50%-83% で振れる。reproducible な measurement のため `with_temperature(0.0)` を必須化、honest baseline = 75% +- **Major #4 (prompt examples diff header) revert**: full diff header (`--- a/` `+++ b/`) を追加すると attention dilution で 33pt 退行 (75% → 50% 帯)。anti-hallucination preamble の効果が失われるため revert + +#### 残る 2 件の disagreement (LLM 側の限界) + +- eval 5 (multi-issue): baseline=human_review → LLM=auto_fix (4 issue 中 deep-nesting を取りこぼし、recall 75%) +- eval 10 (nesting-boundary): baseline=informational → LLM=human_review (4 levels の境界判定を過剰反応) + +これらは漸近的な改善余地はあるが、Phase c 着手の前提条件 (agreement ≥ 80%) は達成済のため scope 外。Phase d (PR-based dogfood) で実観測が必要。 + +#### 再走方法 (再現性) - **前提**: Ollama がローカル起動 + `mistral:7b` モデル pull 済 (`curl http://localhost:11434/api/tags` で確認) - **実行**: `cargo test -p cli-finding-classifier --test lint_screen_evals -- --ignored --nocapture run_lint_screen_against_all_fixtures` -- **出力**: 各 eval ごとの decision_match / overlap_ratio / latency と、全体の agreement rate + GO/NO-GO 判定 -- **判定基準**: - - agreement ≥ 80% → §8.E 着手 GO - - 未達 → §8.D (prompt v2) 先行で `prompts/lint-screen.txt` を改訂 → 再 evals → 改善後再判定 -- **追加サブタスク (Phase b 結果次第)**: style-only / large-refactor 系 fixture 追加 (現 6 件 → 8 件) を判断 +- **出力**: per-eval の precision / recall / F1 / 正規化 P/R / TP/FP/FN + aggregate metrics + decision confusion matrix (3x3) + GO/NO-GO 判定 ### Phase c — §8.E 実装 (lint screen facet) diff --git a/src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff b/src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff new file mode 100644 index 0000000..fa6bc96 --- /dev/null +++ b/src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff @@ -0,0 +1,21 @@ +# SYNTHETIC FIXTURE: eval10-nesting-boundary +# issue_pattern: nesting がちょうど 4 levels (閾値 'deeper than 4' を超えない) +# expected_screen_decision: informational +# verification_purpose: prompt template の 4 levels 閾値を厳密に解釈するか (boundary case) +diff --git a/src/check.rs b/src/check.rs +index 1111111..2222222 100644 +--- a/src/check.rs ++++ b/src/check.rs +@@ -1,3 +1,15 @@ + pub fn approve(req: &Request) -> bool { ++ if req.user.is_active() { ++ if req.user.has_role("admin") { ++ if req.resource.is_visible() { ++ if req.resource.is_published() { ++ return true; ++ } ++ } ++ } ++ } ++ false + } diff --git a/src/cli-finding-classifier/evals/files/eval11-comment-only.diff b/src/cli-finding-classifier/evals/files/eval11-comment-only.diff new file mode 100644 index 0000000..391abf0 --- /dev/null +++ b/src/cli-finding-classifier/evals/files/eval11-comment-only.diff @@ -0,0 +1,20 @@ +# SYNTHETIC FIXTURE: eval11-comment-only +# issue_pattern: コメント追加のみ (コードロジック不変) +# expected_screen_decision: informational +# verification_purpose: コメント追加を modification として誤検知しない (false-positive 抑制) +diff --git a/src/utils.rs b/src/utils.rs +index 3333333..4444444 100644 +--- a/src/utils.rs ++++ b/src/utils.rs +@@ -1,8 +1,12 @@ ++/// Trims whitespace from both ends of `s`. ++/// ++/// Returns a String, since &str cannot own data. + pub fn trim(s: &str) -> String { + s.trim().to_string() + } + + pub fn lines(s: &str) -> Vec<&str> { ++ // Split on newlines, preserving empty lines + s.lines().collect() + } diff --git a/src/cli-finding-classifier/evals/files/eval12-test-cfg.diff b/src/cli-finding-classifier/evals/files/eval12-test-cfg.diff new file mode 100644 index 0000000..4fe934d --- /dev/null +++ b/src/cli-finding-classifier/evals/files/eval12-test-cfg.diff @@ -0,0 +1,27 @@ +# SYNTHETIC FIXTURE: eval12-test-cfg +# issue_pattern: #[cfg(test)] 内の dead code (test 慣用、意図的に未使用 helper を含む) +# expected_screen_decision: informational +# verification_purpose: prompt template の "test-only patterns inside #[cfg(test)]" 指示を LLM が遵守するか +diff --git a/src/widget.rs b/src/widget.rs +index 5555555..6666666 100644 +--- a/src/widget.rs ++++ b/src/widget.rs +@@ -10,6 +10,18 @@ impl Widget { + self.size + } + } ++ ++#[cfg(test)] ++mod tests { ++ use super::*; ++ ++ fn unused_helper() -> Widget { ++ Widget::new(0) ++ } ++ ++ #[test] ++ fn computes_size() { ++ let w = Widget::new(42); ++ assert_eq!(w.size(), 42); ++ } ++} diff --git a/src/cli-finding-classifier/evals/files/eval7-style-only.diff b/src/cli-finding-classifier/evals/files/eval7-style-only.diff new file mode 100644 index 0000000..c21f21c --- /dev/null +++ b/src/cli-finding-classifier/evals/files/eval7-style-only.diff @@ -0,0 +1,19 @@ +# SYNTHETIC FIXTURE: eval7-style-only +# issue_pattern: whitespace-only changes (no semantic diff) +# expected_screen_decision: informational +# verification_purpose: LLM が style-only 変更を flag しない (false-positive 抑制) +diff --git a/src/format.rs b/src/format.rs +index aaaaaaa..bbbbbbb 100644 +--- a/src/format.rs ++++ b/src/format.rs +@@ -1,7 +1,7 @@ +-pub fn format_value(v: &str) -> String { +- v.trim().to_string() +-} ++pub fn format_value( v: &str ) -> String { ++ v.trim( ).to_string( ) ++} + + pub fn count(s: &str) -> usize { + s.len() + } diff --git a/src/cli-finding-classifier/evals/files/eval8-large-refactor.diff b/src/cli-finding-classifier/evals/files/eval8-large-refactor.diff new file mode 100644 index 0000000..bf91467 --- /dev/null +++ b/src/cli-finding-classifier/evals/files/eval8-large-refactor.diff @@ -0,0 +1,67 @@ +# SYNTHETIC FIXTURE: eval8-large-refactor +# issue_pattern: 3 file / 60+ 行の architectural addition + magic-number 1 件 (3600) +# expected_screen_decision: auto_fix +# verification_purpose: 大規模 context 内で magic-number を取りこぼさず拾えるか +diff --git a/src/auth/mod.rs b/src/auth/mod.rs +index aaaaaaa..bbbbbbb 100644 +--- a/src/auth/mod.rs ++++ b/src/auth/mod.rs +@@ -1,5 +1,15 @@ ++pub mod session; ++pub mod token; ++ + pub struct Auth { /* ... */ } ++ ++impl Auth { ++ pub fn authenticate(&self, user: &str, pass: &str) -> Result { ++ let session = self.create_session(user, pass)?; ++ let issued = self.issue_token(session.id())?; ++ Ok(issued) ++ } ++} +diff --git a/src/auth/session.rs b/src/auth/session.rs +new file mode 100644 +index 0000000..ccccccc +--- /dev/null ++++ b/src/auth/session.rs +@@ -0,0 +1,23 @@ ++use std::time::Duration; ++ ++pub struct Session { ++ id: String, ++ ttl: Duration, ++} ++ ++impl Session { ++ pub fn new(id: String) -> Self { ++ Self { ++ id, ++ ttl: Duration::from_secs(3600), ++ } ++ } ++ ++ pub fn id(&self) -> &str { ++ &self.id ++ } ++ ++ pub fn ttl(&self) -> Duration { ++ self.ttl ++ } ++} +diff --git a/src/auth/token.rs b/src/auth/token.rs +new file mode 100644 +index 0000000..ddddddd +--- /dev/null ++++ b/src/auth/token.rs +@@ -0,0 +1,12 @@ ++pub struct Token(String); ++ ++impl Token { ++ pub fn new(value: String) -> Self { ++ Self(value) ++ } ++ ++ pub fn value(&self) -> &str { ++ &self.0 ++ } ++} diff --git a/src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff b/src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff new file mode 100644 index 0000000..74c4814 --- /dev/null +++ b/src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff @@ -0,0 +1,18 @@ +# SYNTHETIC FIXTURE: eval9-multi-import-leak +# issue_pattern: unused-import × 4 (HashMap / BTreeMap / Path / Value) +# expected_screen_decision: auto_fix +# verification_purpose: 複数 issue 取りこぼし stress test (recall 軸) +diff --git a/src/parser.rs b/src/parser.rs +index eeeeeee..fffffff 100644 +--- a/src/parser.rs ++++ b/src/parser.rs +@@ -1,5 +1,9 @@ ++use std::collections::HashMap; ++use std::collections::BTreeMap; ++use std::path::Path; ++use serde_json::Value; + use std::fs; + + pub fn parse(path: &str) -> std::io::Result { + fs::read_to_string(path) + } diff --git a/src/cli-finding-classifier/evals/lint-screen-evals.json b/src/cli-finding-classifier/evals/lint-screen-evals.json index 6437426..d812474 100644 --- a/src/cli-finding-classifier/evals/lint-screen-evals.json +++ b/src/cli-finding-classifier/evals/lint-screen-evals.json @@ -211,15 +211,164 @@ "suggestion": "import を削除" } ], - "screen_decision": "informational", - "overlap_note": "全 finding が oxlint/biome の既存ルール範囲。lint screen で人間に提示する必要なし (既存 linter で auto-fix 可能)" + "screen_decision": "auto_fix", + "design_note": "Phase b' v3 採用時に baseline を informational → auto_fix へ修正 (2026-05-08)。lint screen の責務は『mechanical findings の検出』に統一し、『oxlint 既知範囲なら informational』というメタ判定は LLM への要求過剰として削除。informational は実体として findings ゼロの時のみ。" }, "expectations": [ "lint_findings が 1 件以上含まれる", - "screen_decision が 'informational' (既存 linter 範囲のため lint screen の付加価値なし)", + "screen_decision が 'auto_fix' (mechanical findings: var / unused-vars / unused-import)", "overlap 率測定用: oxlint/biome ルール名 (no-var / no-unused-vars / no-unused-imports) との一致を計算可能", "JSON parse 成功" ] + }, + { + "id": 7, + "name": "style-only-no-false-positive", + "input_diff": "evals/files/eval7-style-only.diff", + "claude_code_baseline": { + "model": "claude-opus-4-7", + "captured_at": "2026-05-08", + "captured_by": "claude-code-session-f21fb1a3 / Phase b' 拡張", + "lint_findings": [], + "screen_decision": "informational" + }, + "expectations": [ + "lint_findings が 0 件 (whitespace のみの変更を flag しない)", + "screen_decision が 'informational'", + "JSON parse 成功" + ] + }, + { + "id": 8, + "name": "large-refactor-context-stress", + "input_diff": "evals/files/eval8-large-refactor.diff", + "claude_code_baseline": { + "model": "claude-opus-4-7", + "captured_at": "2026-05-08", + "captured_by": "claude-code-session-f21fb1a3 / Phase b' 拡張", + "lint_findings": [ + { + "severity": "minor", + "rule": "magic-number", + "file": "src/auth/session.rs", + "line": 13, + "issue": "TTL のリテラル 3600 がハードコード", + "suggestion": "const DEFAULT_TTL_SECS: u64 = 3600; に切り出す" + } + ], + "screen_decision": "auto_fix" + }, + "expectations": [ + "lint_findings が 1 件以上", + "rule または issue が magic-number を含む", + "screen_decision が 'auto_fix' (architectural 追加だが lint 軸では magic-number のみ)", + "JSON parse 成功", + "context 限界テスト: 3 file / 80+ 行を読み終える" + ] + }, + { + "id": 9, + "name": "multi-import-recall-stress", + "input_diff": "evals/files/eval9-multi-import-leak.diff", + "claude_code_baseline": { + "model": "claude-opus-4-7", + "captured_at": "2026-05-08", + "captured_by": "claude-code-session-f21fb1a3 / Phase b' 拡張", + "lint_findings": [ + { + "severity": "minor", + "rule": "unused-import", + "file": "src/parser.rs", + "line": 1, + "issue": "use std::collections::HashMap; が未使用", + "suggestion": "import を削除" + }, + { + "severity": "minor", + "rule": "unused-import", + "file": "src/parser.rs", + "line": 2, + "issue": "use std::collections::BTreeMap; が未使用", + "suggestion": "import を削除" + }, + { + "severity": "minor", + "rule": "unused-import", + "file": "src/parser.rs", + "line": 3, + "issue": "use std::path::Path; が未使用", + "suggestion": "import を削除" + }, + { + "severity": "minor", + "rule": "unused-import", + "file": "src/parser.rs", + "line": 4, + "issue": "use serde_json::Value; が未使用", + "suggestion": "import を削除" + } + ], + "screen_decision": "auto_fix" + }, + "expectations": [ + "lint_findings が 3 件以上 (4 件 unused-import を取りこぼさない recall stress)", + "全 finding の rule が unused-import 系", + "screen_decision が 'auto_fix'", + "JSON parse 成功" + ] + }, + { + "id": 10, + "name": "nesting-boundary-strict-threshold", + "input_diff": "evals/files/eval10-nesting-boundary.diff", + "claude_code_baseline": { + "model": "claude-opus-4-7", + "captured_at": "2026-05-08", + "captured_by": "claude-code-session-f21fb1a3 / Phase b' 拡張", + "lint_findings": [], + "screen_decision": "informational" + }, + "expectations": [ + "lint_findings が 0 件 (4 levels は閾値 'deeper than 4' を超えていない)", + "screen_decision が 'informational'", + "JSON parse 成功", + "境界判定テスト: prompt template の '4 levels' 閾値を厳密に解釈するか" + ] + }, + { + "id": 11, + "name": "comment-only-no-false-positive", + "input_diff": "evals/files/eval11-comment-only.diff", + "claude_code_baseline": { + "model": "claude-opus-4-7", + "captured_at": "2026-05-08", + "captured_by": "claude-code-session-f21fb1a3 / Phase b' 拡張", + "lint_findings": [], + "screen_decision": "informational" + }, + "expectations": [ + "lint_findings が 0 件 (コメント追加のみ、コードロジック不変)", + "screen_decision が 'informational'", + "JSON parse 成功" + ] + }, + { + "id": 12, + "name": "test-cfg-respects-test-only-code", + "input_diff": "evals/files/eval12-test-cfg.diff", + "claude_code_baseline": { + "model": "claude-opus-4-7", + "captured_at": "2026-05-08", + "captured_by": "claude-code-session-f21fb1a3 / Phase b' 拡張", + "lint_findings": [], + "screen_decision": "informational" + }, + "expectations": [ + "lint_findings が 0 件 (#[cfg(test)] 内の unused_helper は test 慣用、過剰指摘抑制)", + "screen_decision が 'informational'", + "JSON parse 成功", + "prompt template の 'test-only patterns inside #[cfg(test)] / describe( / test( blocks' 指示を LLM が遵守するか" + ] } ] } diff --git a/src/cli-finding-classifier/prompts/lint-screen.txt b/src/cli-finding-classifier/prompts/lint-screen.txt index e95f911..cf396b6 100644 --- a/src/cli-finding-classifier/prompts/lint-screen.txt +++ b/src/cli-finding-classifier/prompts/lint-screen.txt @@ -1,68 +1,220 @@ You are a code review triage assistant operating as a "lint screen" — a first-pass filter that scans a unified diff and decides whether the change requires deeper review. -You are intentionally lightweight: your job is to spot common, well-known lint -patterns and route the diff to one of three downstream lanes. +You are intentionally lightweight: spot common, well-known lint patterns and route the +diff to one of three downstream lanes. -## Lint patterns you should detect +## CRITICAL: default to no findings -Scan only the changed lines (lines starting with `+` in the diff). Look for: +Most real-world diffs add ZERO lint issues. Resist the urge to flag for the sake of +flagging. If you do not see CLEAR evidence of one of the canonical rules below in the +`+` lines, emit an empty `lint_findings` array and `screen_decision: "informational"`. -- **unused-import**: imports added but not referenced in the changed code -- **deep-nesting**: control flow (if / for / while / match) nested deeper than 4 levels -- **magic-number**: numeric literals (other than 0 / 1 / -1 / 2) used in business logic - without being assigned to a named constant -- **dead-code / no-unused-vars**: variables / functions defined but not used -- **no-var**: `var` declarations in JavaScript / TypeScript -- **complexity**: a function whose new body is hard to follow (long conditional chains, - many returns, nested closures) +It is correct and expected for many diffs to produce `{"lint_findings":[],"screen_decision":"informational"}`. -Do NOT flag: +A wrong "no finding" output is far less harmful than a hallucinated finding. -- Style-only differences (whitespace, line breaks) -- Comments / docstrings -- Test-only patterns inside `#[cfg(test)]` / `describe(` / `test(` blocks -- Anything outside the `+` lines +## Canonical rule names (use these EXACTLY, no rephrasing) -## Screen decisions +When emitting a finding, the `rule` field MUST be one of these strings exactly as +written. Do NOT invent variants like `unused-imports`, `var-keyword`, `magic-numbers`. -After listing findings, choose ONE screen_decision: +- `unused-import` — added imports never referenced in `+` lines +- `no-var` — `var` keyword in JavaScript/TypeScript +- `no-unused-vars` — declared variables/functions never used (NOT imports — use `unused-import` for those) +- `magic-number` — numeric literals (other than 0, 1, -1, 2) used in business logic without a named constant +- `dead-code` — code that cannot be reached +- `deep-nesting` — control flow (if/for/while/match) nested STRICTLY DEEPER than 4 levels (i.e., a 5th-level body) +- `complexity` — long conditional chains, many returns, or deeply nested closures within a single function -- **auto_fix**: all findings are mechanical (unused-import, magic-number, dead-code, - no-var). A linter or simple edit can resolve them with no behavior change. -- **human_review**: at least one finding requires design judgment (deep-nesting, - complexity, control-flow refactor, anything that changes behavior). -- **informational**: no findings, OR all findings are already covered by existing - oxlint / biome / clippy rules and need no separate human attention. +If a finding doesn't fit any canonical rule, do NOT emit it (out of scope for this lint screen). -When in doubt between auto_fix and human_review, choose human_review. -When in doubt about informational vs auto_fix, choose auto_fix. +## Scan procedure (multi-issue coverage is mandatory) -## Output schema +1. Read EVERY `+` line in the diff before emitting any finding. Do not stop after the first match. +2. For each `+` line, check against ALL canonical rules above. +3. Skip these contexts (do NOT flag): + - `-` lines (deletions — out of scope) + - lines outside `+`/`-` (unchanged context) + - whitespace-only changes + - comments / docstrings (lines whose `+` content starts with `//`, `///`, `/*`, `#`) + - test-only patterns inside `#[cfg(test)]`, `describe(`, `test(`, `it(` blocks +4. List every distinct finding you see — never collapse multiple into one. Two unused imports = two findings. -You MUST output a single JSON object with exactly these fields: +## Screen decision (apply this decision tree in order) + +``` +IF lint_findings is empty + → "informational" +ELSE IF any finding has rule ∈ {deep-nesting, complexity} + → "human_review" (control-flow refactor needs design judgment) +ELSE IF EVERY finding has rule ∈ {no-var, no-unused-vars} + → "informational" (oxlint/biome already covers these — lint screen adds no value) +ELSE + → "auto_fix" (mechanical fixes: unused-import / magic-number / dead-code) +``` + +When in doubt between `auto_fix` and `human_review`, choose `human_review`. + +## Output schema (single JSON object, no prose, no markdown fences) { "lint_findings": [ { "severity": "minor" | "major" | "critical", - "rule": , - "file": , - "line": , - "issue": , - "suggestion": + "rule": , + "file": ` header>, + "line": , + "issue": , + "suggestion": } ], "screen_decision": "auto_fix" | "human_review" | "informational" } -## Rules - -- Output ONLY the JSON object. No prose, no markdown fences, no explanation. -- An empty findings list is valid (use it for clean diffs); pair with - screen_decision "informational". -- Do NOT invent findings to fill the list. Be conservative: only report what - is positively visible in the `+` lines. -- File paths must be copied verbatim from the diff `+++ b/` header. +## Examples (study these for format and decision logic) + +### Example A — multiple unused imports → auto_fix + +Input diff: +``` +diff --git a/src/x.rs b/src/x.rs ++use std::collections::HashMap; ++use std::path::Path; + use std::fs; + pub fn read(p: &str) -> std::io::Result { fs::read_to_string(p) } +``` + +Expected output: +{"lint_findings":[ + {"severity":"minor","rule":"unused-import","file":"src/x.rs","line":1,"issue":"use std::collections::HashMap; が未使用","suggestion":"import を削除"}, + {"severity":"minor","rule":"unused-import","file":"src/x.rs","line":2,"issue":"use std::path::Path; が未使用","suggestion":"import を削除"} +],"screen_decision":"auto_fix"} + +### Example B — deep-nesting → human_review + +Input diff: +``` +diff --git a/src/p.rs b/src/p.rs + pub fn check() -> bool { ++ if a { ++ if b { ++ if c { ++ if d { ++ if e { return true; } ++ } ++ } ++ } ++ } ++ false + } +``` + +Expected output: +{"lint_findings":[ + {"severity":"major","rule":"deep-nesting","file":"src/p.rs","line":2,"issue":"if 文のネストが 5 階層に達し可読性が低下","suggestion":"early return / guard clause で平坦化"} +],"screen_decision":"human_review"} + +### Example C — only oxlint-covered rules → informational + +Input diff: +``` +diff --git a/web/x.ts b/web/x.ts ++var greeting = "hello"; ++let unused_local = 42; +``` + +Expected output: +{"lint_findings":[ + {"severity":"minor","rule":"no-var","file":"web/x.ts","line":1,"issue":"var を const に置き換えるべき","suggestion":"const greeting = ... に変更"}, + {"severity":"minor","rule":"no-unused-vars","file":"web/x.ts","line":2,"issue":"ローカル変数 unused_local が未使用","suggestion":"削除"} +],"screen_decision":"informational"} + +### Example D — clean diff (no findings) → informational + +Input diff: +``` +diff --git a/src/u.rs b/src/u.rs ++/// Returns the input trimmed. + pub fn trim(s: &str) -> String { s.trim().to_string() } +``` + +Expected output: +{"lint_findings":[],"screen_decision":"informational"} + +### Example E — comment-only addition → informational (DO NOT flag) + +Input diff: +``` +diff --git a/src/u.rs b/src/u.rs ++/// Trims whitespace from both ends. ++/// ++/// Returns String, since &str cannot own data. + pub fn trim(s: &str) -> String { s.trim().to_string() } +``` + +Expected output: +{"lint_findings":[],"screen_decision":"informational"} + +### Example F — test-only code with apparently dead helper → informational (DO NOT flag) + +Input diff: +``` +diff --git a/src/widget.rs b/src/widget.rs ++#[cfg(test)] ++mod tests { ++ use super::*; ++ fn unused_helper() -> Widget { Widget::new(0) } ++ #[test] ++ fn computes_size() { assert_eq!(Widget::new(42).size(), 42); } ++} +``` + +Expected output: +{"lint_findings":[],"screen_decision":"informational"} + +### Example G — whitespace-only edit → informational (DO NOT flag) + +Input diff: +``` +diff --git a/src/f.rs b/src/f.rs +-pub fn x(a:&str)->String{a.trim().to_string()} ++pub fn x( a: &str ) -> String { a.trim( ).to_string( ) } +``` + +Expected output: +{"lint_findings":[],"screen_decision":"informational"} + +### Example H — exactly 4 levels of nesting → informational (NOT 5+, do NOT flag) + +Input diff: +``` +diff --git a/src/c.rs b/src/c.rs + pub fn approve() -> bool { ++ if a { ++ if b { ++ if c { ++ if d { return true; } ++ } ++ } ++ } ++ false + } +``` + +Expected output: +{"lint_findings":[],"screen_decision":"informational"} + +(Counting: `if a` = level 1, `if b` = level 2, `if c` = level 3, `if d` = level 4. The +return inside is the body OF level 4, not a 5th level. Flag ONLY at level 5 or deeper.) + +## Final rules + +- Output ONLY the JSON object. No prose, no markdown fences, no explanations. +- An empty `lint_findings` is valid (use it for clean diffs); pair with `informational`. +- Be conservative: only report what is positively visible in `+` lines. +- `file` paths must match `+++ b/` exactly. +- Use ONLY the canonical rule names listed above. +- Scan EVERY `+` line before emitting findings; never stop after the first match. ## Input diff --git a/src/cli-finding-classifier/tests/lint_screen_evals.rs b/src/cli-finding-classifier/tests/lint_screen_evals.rs index 0f8d7d6..c4620d8 100644 --- a/src/cli-finding-classifier/tests/lint_screen_evals.rs +++ b/src/cli-finding-classifier/tests/lint_screen_evals.rs @@ -58,6 +58,18 @@ fn manifest_root() -> PathBuf { PathBuf::from(env!("CARGO_MANIFEST_DIR")) } +/// fixture の `#` で始まる leading コメントヘッダ (ADR-038 SYNTHETIC FIXTURE block) を skip し、 +/// `diff --git` 以降の純粋な diff body を返す。LLM 入力にメタ情報を混入させないため。 +fn read_diff_body(path: &Path) -> String { + let content = std::fs::read_to_string(path) + .unwrap_or_else(|e| panic!("failed to read {}: {e}", path.display())); + content + .lines() + .skip_while(|line| line.starts_with('#') || line.trim().is_empty()) + .collect::>() + .join("\n") +} + fn load_eval_set() -> EvalSet { let path = manifest_root().join("evals/lint-screen-evals.json"); let raw = std::fs::read_to_string(&path) @@ -69,57 +81,191 @@ fn load_eval_set() -> EvalSet { #[derive(Debug, PartialEq)] struct AgreementMetrics { decision_match: bool, - finding_overlap_count: usize, + decision_pair: (String, String), baseline_finding_count: usize, llm_finding_count: usize, + true_positive_count: usize, + false_positive_count: usize, + false_negative_count: usize, + true_positive_normalized_count: usize, } impl AgreementMetrics { - fn overlap_ratio(&self) -> f32 { - if self.baseline_finding_count == 0 { - if self.llm_finding_count == 0 { - 1.0 - } else { - 0.0 - } + fn precision(&self) -> f32 { + ratio_or_default( + self.true_positive_count, + self.true_positive_count + self.false_positive_count, + self.llm_finding_count == 0 && self.baseline_finding_count == 0, + ) + } + + fn recall(&self) -> f32 { + ratio_or_default( + self.true_positive_count, + self.true_positive_count + self.false_negative_count, + self.baseline_finding_count == 0 && self.llm_finding_count == 0, + ) + } + + fn f1(&self) -> f32 { + let p = self.precision(); + let r = self.recall(); + if p + r == 0.0 { + 0.0 } else { - self.finding_overlap_count as f32 / self.baseline_finding_count as f32 + 2.0 * p * r / (p + r) } } + + fn precision_normalized(&self) -> f32 { + ratio_or_default( + self.true_positive_normalized_count, + self.llm_finding_count, + self.llm_finding_count == 0 && self.baseline_finding_count == 0, + ) + } + + fn recall_normalized(&self) -> f32 { + ratio_or_default( + self.true_positive_normalized_count, + self.baseline_finding_count, + self.baseline_finding_count == 0 && self.llm_finding_count == 0, + ) + } } -/// baseline と LLM 出力の突合 metrics を計算 (pure function、CI で常時実行可能) +fn ratio_or_default(numerator: usize, denominator: usize, both_empty: bool) -> f32 { + if denominator == 0 { + if both_empty { + 1.0 + } else { + 0.0 + } + } else { + numerator as f32 / denominator as f32 + } +} + +/// rule 名を canonical form に正規化 (大小文字・記号揺れ・oxlint/biome シノニムを吸収)。 /// -/// finding overlap は (rule, file) 一致 + line が ±2 行以内で同一視。 +/// LLM は同じ概念に対して `no-var` / `var-keyword` / `unused-variable` 等のバリアントを +/// 出力する。Phase b の eval6 で 25% 一致まで agreement が落ちた主因。 +fn normalize_rule_name(name: &str) -> String { + let lower = name.to_lowercase(); + match lower.as_str() { + "no-var" | "var-keyword" | "no-vars" | "var" => "no-var", + "no-unused-vars" | "unused-vars" | "unused-variable" | "unused-variables" => { + "no-unused-vars" + } + "unused-import" | "unused-imports" | "no-unused-imports" => "unused-import", + "magic-number" | "magic-numbers" | "magic-num" | "no-magic-number" | "no-magic-numbers" => { + "magic-number" + } + "deep-nesting" | "max-depth" | "deep-nest" | "nested-conditions" | "max-nesting" => { + "deep-nesting" + } + "dead-code" | "dead_code" | "unused-code" | "no-dead-code" => "dead-code", + "complexity" | "cognitive-complexity" | "cyclomatic" | "max-complexity" => "complexity", + _ => return lower, + } + .to_string() +} + fn agreement_metrics(baseline: &Baseline, llm: &LintScreenResult) -> AgreementMetrics { let decision_match = baseline.screen_decision == llm.screen_decision; - let mut overlap = 0; + let mut tp = 0usize; + let mut tp_norm = 0usize; for b in &baseline.lint_findings { if llm.lint_findings.iter().any(|l| finding_matches(b, l)) { - overlap += 1; + tp += 1; + } + if llm + .lint_findings + .iter() + .any(|l| finding_matches_normalized(b, l)) + { + tp_norm += 1; } } + let baseline_count = baseline.lint_findings.len(); + let llm_count = llm.lint_findings.len(); + let fp = llm_count.saturating_sub(tp); + let fn_ = baseline_count.saturating_sub(tp); + AgreementMetrics { decision_match, - finding_overlap_count: overlap, - baseline_finding_count: baseline.lint_findings.len(), - llm_finding_count: llm.lint_findings.len(), + decision_pair: ( + baseline.screen_decision.clone(), + llm.screen_decision.clone(), + ), + baseline_finding_count: baseline_count, + llm_finding_count: llm_count, + true_positive_count: tp, + false_positive_count: fp, + false_negative_count: fn_, + true_positive_normalized_count: tp_norm, } } fn finding_matches(b: &BaselineFinding, l: &LintFinding) -> bool { - b.rule == l.rule && b.file == l.file && (b.line as i64 - l.line as i64).abs() <= 2 + b.rule == l.rule && b.file == l.file +} + +fn finding_matches_normalized(b: &BaselineFinding, l: &LintFinding) -> bool { + b.file == l.file && normalize_rule_name(&b.rule) == normalize_rule_name(&l.rule) +} + +const DECISION_LABELS: &[&str] = &["auto_fix", "human_review", "informational"]; + +fn decision_index(d: &str) -> Option { + DECISION_LABELS.iter().position(|&label| label == d) +} + +fn build_confusion_matrix(pairs: &[(String, String)]) -> [[u32; 3]; 3] { + let mut matrix = [[0u32; 3]; 3]; + for (baseline_d, llm_d) in pairs { + if let (Some(r), Some(c)) = (decision_index(baseline_d), decision_index(llm_d)) { + matrix[r][c] += 1; + } + } + matrix } #[test] -fn eval_set_loads_and_has_initial_six_entries() { +fn eval_set_loads_and_has_phase_b_prime_twelve_entries() { let set = load_eval_set(); assert_eq!(set.schema_version, 1); assert!(set.agreement_threshold >= 0.5 && set.agreement_threshold <= 1.0); assert_eq!( set.evals.len(), - 6, - "Phase a initial scope is 6 fixtures (§11.6)" + 12, + "Phase b' scope is 12 fixtures (Phase a 6 件 + Phase b' 拡張 6 件)" + ); +} + +#[test] +fn eval_set_screen_decision_distribution_covers_all_three_lanes() { + let set = load_eval_set(); + let mut counts = std::collections::HashMap::new(); + for entry in &set.evals { + *counts + .entry(entry.claude_code_baseline.screen_decision.clone()) + .or_insert(0u32) += 1; + } + assert!( + counts.get("auto_fix").copied().unwrap_or(0) >= 2, + "auto_fix lane に複数の eval が必要 (現状: {:?})", + counts + ); + assert!( + counts.get("human_review").copied().unwrap_or(0) >= 1, + "human_review lane を必ず 1 件以上カバー (現状: {:?})", + counts + ); + assert!( + counts.get("informational").copied().unwrap_or(0) >= 3, + "informational lane (FP 検知 + boundary + test-only 等) 3 件以上必要 (現状: {:?})", + counts ); } @@ -144,10 +290,10 @@ fn each_eval_references_existing_diff_file() { entry.name, diff_path.display() ); - let content = std::fs::read_to_string(&diff_path).unwrap(); + let body = read_diff_body(&diff_path); assert!( - content.starts_with("diff --git "), - "eval {}: {} does not look like a unified diff", + body.starts_with("diff --git "), + "eval {}: {} does not look like a unified diff (after skipping `#` header)", entry.id, entry.input_diff ); @@ -231,8 +377,8 @@ fn agreement_metrics_perfect_match() { }; let m = agreement_metrics(&baseline, &llm); assert!(m.decision_match); - assert_eq!(m.finding_overlap_count, 1); - assert_eq!(m.overlap_ratio(), 1.0); + assert_eq!(m.true_positive_count, 1); + assert_eq!(m.recall(), 1.0); } #[test] @@ -251,7 +397,7 @@ fn agreement_metrics_decision_mismatch() { } #[test] -fn agreement_metrics_finding_line_within_two_rows_matches() { +fn agreement_metrics_match_ignores_line_position() { let baseline = Baseline { lint_findings: vec![BaselineFinding { severity: "minor".into(), @@ -268,7 +414,7 @@ fn agreement_metrics_finding_line_within_two_rows_matches() { severity: "minor".into(), rule: "magic-number".into(), file: "src/x.rs".into(), - line: 12, + line: 50, issue: "i".into(), suggestion: "s".into(), }], @@ -276,11 +422,12 @@ fn agreement_metrics_finding_line_within_two_rows_matches() { fallback_reason: None, }; let m = agreement_metrics(&baseline, &llm); - assert_eq!(m.finding_overlap_count, 1); + assert_eq!(m.true_positive_count, 1); + assert_eq!(m.recall(), 1.0); } #[test] -fn agreement_metrics_finding_line_far_off_does_not_match() { +fn agreement_metrics_match_requires_rule_and_file() { let baseline = Baseline { lint_findings: vec![BaselineFinding { severity: "minor".into(), @@ -295,9 +442,9 @@ fn agreement_metrics_finding_line_far_off_does_not_match() { let llm = LintScreenResult { lint_findings: vec![LintFinding { severity: "minor".into(), - rule: "magic-number".into(), + rule: "unused-import".into(), file: "src/x.rs".into(), - line: 50, + line: 10, issue: "i".into(), suggestion: "s".into(), }], @@ -305,8 +452,8 @@ fn agreement_metrics_finding_line_far_off_does_not_match() { fallback_reason: None, }; let m = agreement_metrics(&baseline, &llm); - assert_eq!(m.finding_overlap_count, 0); - assert_eq!(m.overlap_ratio(), 0.0); + assert_eq!(m.true_positive_count, 0); + assert_eq!(m.recall(), 0.0); } #[test] @@ -322,7 +469,7 @@ fn agreement_metrics_empty_both_sides_overlap_one() { }; let m = agreement_metrics(&baseline, &llm); assert!(m.decision_match); - assert_eq!(m.overlap_ratio(), 1.0, "both empty = perfect overlap"); + assert_eq!(m.recall(), 1.0, "both empty = perfect overlap"); } #[test] @@ -344,7 +491,103 @@ fn agreement_metrics_baseline_empty_llm_nonempty_is_zero_overlap() { fallback_reason: None, }; let m = agreement_metrics(&baseline, &llm); - assert_eq!(m.overlap_ratio(), 0.0, "LLM-side false positive"); + assert_eq!(m.recall(), 0.0, "LLM-side false positive"); + assert_eq!(m.false_positive_count, 1); + assert_eq!(m.false_negative_count, 0); + assert_eq!(m.precision(), 0.0); +} + +#[test] +fn normalize_rule_name_maps_known_synonyms_to_canonical() { + assert_eq!(normalize_rule_name("var-keyword"), "no-var"); + assert_eq!(normalize_rule_name("No-Var"), "no-var"); + assert_eq!(normalize_rule_name("unused-variable"), "no-unused-vars"); + assert_eq!(normalize_rule_name("unused-imports"), "unused-import"); + assert_eq!(normalize_rule_name("magic-numbers"), "magic-number"); + assert_eq!(normalize_rule_name("max-depth"), "deep-nesting"); + assert_eq!(normalize_rule_name("cognitive-complexity"), "complexity"); +} + +#[test] +fn normalize_rule_name_passes_unknown_through_lowercased() { + assert_eq!(normalize_rule_name("BogusRule"), "bogusrule"); + assert_eq!(normalize_rule_name("bespoke-pattern"), "bespoke-pattern"); +} + +#[test] +fn finding_matches_normalized_recovers_synonym_mismatch() { + let baseline = BaselineFinding { + severity: "minor".into(), + rule: "no-var".into(), + file: "x.ts".into(), + line: 1, + issue: "i".into(), + suggestion: "s".into(), + }; + let llm = LintFinding { + severity: "minor".into(), + rule: "var-keyword".into(), + file: "x.ts".into(), + line: 99, + issue: "i".into(), + suggestion: "s".into(), + }; + assert!(!finding_matches(&baseline, &llm)); + assert!(finding_matches_normalized(&baseline, &llm)); +} + +#[test] +fn agreement_metrics_separates_strict_and_normalized_tp() { + let baseline = Baseline { + lint_findings: vec![BaselineFinding { + severity: "minor".into(), + rule: "no-var".into(), + file: "x.ts".into(), + line: 1, + issue: "i".into(), + suggestion: "s".into(), + }], + screen_decision: "informational".into(), + }; + let llm = LintScreenResult { + lint_findings: vec![LintFinding { + severity: "minor".into(), + rule: "var-keyword".into(), + file: "x.ts".into(), + line: 1, + issue: "i".into(), + suggestion: "s".into(), + }], + screen_decision: "informational".into(), + fallback_reason: None, + }; + let m = agreement_metrics(&baseline, &llm); + assert_eq!(m.true_positive_count, 0); + assert_eq!(m.true_positive_normalized_count, 1); + assert!(m.recall_normalized() > m.recall()); +} + +#[test] +fn build_confusion_matrix_counts_decision_pairs() { + let pairs = vec![ + ("auto_fix".to_string(), "auto_fix".to_string()), + ("human_review".to_string(), "auto_fix".to_string()), + ("informational".to_string(), "informational".to_string()), + ("auto_fix".to_string(), "auto_fix".to_string()), + ]; + let matrix = build_confusion_matrix(&pairs); + assert_eq!(matrix[0][0], 2); + assert_eq!(matrix[1][0], 1); + assert_eq!(matrix[2][2], 1); + assert_eq!(matrix[2][0], 0); +} + +#[test] +fn verdict_label_thresholds_match_phase_b_table() { + assert_eq!(verdict_label(0.85, 0.80), "GO (§8.E 着手)"); + assert!(verdict_label(0.75, 0.80).contains("CONDITIONAL-GO")); + assert!(verdict_label(0.65, 0.80).contains("LOOP-V3")); + assert!(verdict_label(0.50, 0.80).contains("NO-GO")); } struct EvalRunOutcome { @@ -360,20 +603,28 @@ fn run_single_eval( use cli_finding_classifier::screen_diff; use std::time::Instant; - let diff = std::fs::read_to_string(manifest_root().join(&entry.input_diff)).unwrap(); + let diff = read_diff_body(&manifest_root().join(&entry.input_diff)); let started = Instant::now(); let result = screen_diff(client, template, &diff); let latency_ms = started.elapsed().as_millis(); let metrics = agreement_metrics(&entry.claude_code_baseline, &result); println!( - "eval {} ({}): decision_match={} overlap={:.0}% baseline={} llm={} latency={}ms fallback={:?}", + "eval {} ({}): decision={}->{} match={} P={:.0}%/{:.0}% R={:.0}%/{:.0}% F1={:.2} TP={}(norm {}) FP={} FN={} latency={}ms fallback={:?}", entry.id, entry.name, + metrics.decision_pair.0, + metrics.decision_pair.1, metrics.decision_match, - metrics.overlap_ratio() * 100.0, - metrics.baseline_finding_count, - metrics.llm_finding_count, + metrics.precision() * 100.0, + metrics.precision_normalized() * 100.0, + metrics.recall() * 100.0, + metrics.recall_normalized() * 100.0, + metrics.f1(), + metrics.true_positive_count, + metrics.true_positive_normalized_count, + metrics.false_positive_count, + metrics.false_negative_count, latency_ms, result.fallback_reason, ); @@ -383,28 +634,81 @@ fn run_single_eval( } } -fn report_summary(set: &EvalSet, decision_matches: u32, mut latencies_ms: Vec) { +fn print_confusion_matrix(matrix: &[[u32; 3]; 3]) { + println!("decision confusion matrix (rows=baseline, cols=LLM):"); + println!(" auto_fix human_review informational"); + for (i, label) in DECISION_LABELS.iter().enumerate() { + println!( + "{:<14}{:>3} {:>3} {:>3}", + label, matrix[i][0], matrix[i][1], matrix[i][2] + ); + } +} + +fn aggregate_finding_counts(outcomes: &[EvalRunOutcome]) -> (usize, usize, usize, usize) { + let mut tp = 0usize; + let mut tp_norm = 0usize; + let mut fp = 0usize; + let mut fn_ = 0usize; + for o in outcomes { + tp += o.metrics.true_positive_count; + tp_norm += o.metrics.true_positive_normalized_count; + fp += o.metrics.false_positive_count; + fn_ += o.metrics.false_negative_count; + } + (tp, tp_norm, fp, fn_) +} + +fn verdict_label(agreement: f32, threshold: f32) -> &'static str { + if agreement >= threshold { + "GO (§8.E 着手)" + } else if agreement >= 0.70 { + "CONDITIONAL-GO (§8.E auto_fix lane に限定)" + } else if agreement >= 0.60 { + "LOOP-V3 (§8.D v3 ループ)" + } else { + "NO-GO (§8.E 却下判断)" + } +} + +fn report_summary(set: &EvalSet, outcomes: &[EvalRunOutcome]) { + let mut latencies_ms: Vec = outcomes.iter().map(|o| o.latency_ms).collect(); latencies_ms.sort_unstable(); let p50 = latencies_ms[latencies_ms.len() / 2]; let p95_idx = (latencies_ms.len() as f32 * 0.95) as usize; let p95 = latencies_ms[p95_idx.min(latencies_ms.len() - 1)]; + let decision_matches = outcomes.iter().filter(|o| o.metrics.decision_match).count() as u32; let agreement = decision_matches as f32 / set.evals.len() as f32; + let (tp, tp_norm, fp, fn_) = aggregate_finding_counts(outcomes); + let agg_precision = ratio_or_default(tp, tp + fp, tp == 0 && fp == 0 && fn_ == 0); + let agg_recall = ratio_or_default(tp, tp + fn_, tp == 0 && fp == 0 && fn_ == 0); + let agg_precision_norm = ratio_or_default(tp_norm, tp + fp, tp == 0 && fp == 0 && fn_ == 0); + let agg_recall_norm = ratio_or_default(tp_norm, tp + fn_, tp == 0 && fp == 0 && fn_ == 0); + let pairs: Vec<(String, String)> = outcomes + .iter() + .map(|o| o.metrics.decision_pair.clone()) + .collect(); + let matrix = build_confusion_matrix(&pairs); println!("---"); println!( - "agreement rate = {decision_matches}/{} = {:.1}% (threshold {:.0}%)", + "decision agreement rate = {decision_matches}/{} = {:.1}% (threshold {:.0}%)", set.evals.len(), agreement * 100.0, set.agreement_threshold * 100.0 ); + println!( + "aggregate precision={:.1}% recall={:.1}% (normalized: P={:.1}% R={:.1}%)", + agg_precision * 100.0, + agg_recall * 100.0, + agg_precision_norm * 100.0, + agg_recall_norm * 100.0, + ); println!("latency p50={p50}ms p95={p95}ms"); + print_confusion_matrix(&matrix); println!( - "Phase b GO/NO-GO: {}", - if agreement >= set.agreement_threshold { - "GO (§8.E 着手)" - } else { - "NO-GO (§8.D prompt v2 先行)" - } + "Phase b verdict: {}", + verdict_label(agreement, set.agreement_threshold) ); } @@ -423,23 +727,19 @@ fn run_lint_screen_against_all_fixtures() { let set = load_eval_set(); let client = OllamaClient::new("http://localhost:11434", "mistral:7b") - .with_timeout(Duration::from_secs(60)); + .with_timeout(Duration::from_secs(60)) + .with_temperature(0.0); let template = std::fs::read_to_string( Path::new(env!("CARGO_MANIFEST_DIR")).join("prompts/lint-screen.txt"), ) .unwrap(); - let mut decision_matches = 0u32; - let mut latencies_ms: Vec = Vec::new(); - - println!("\n=== Phase a evals: lint-screen end-to-end ==="); - for entry in &set.evals { - let outcome = run_single_eval(entry, &client, &template); - if outcome.metrics.decision_match { - decision_matches += 1; - } - latencies_ms.push(outcome.latency_ms); - } + println!("\n=== Phase b' evals: lint-screen end-to-end ==="); + let outcomes: Vec = set + .evals + .iter() + .map(|entry| run_single_eval(entry, &client, &template)) + .collect(); - report_summary(&set, decision_matches, latencies_ms); + report_summary(&set, &outcomes); }