feat(cli-finding-classifier): Phase b' lint-screen evals — N=12 + multi-metrics + prompt v3 (83.3% GO) by aloekun · Pull Request #131 · aloekun/claude-code-hook-test

aloekun · 2026-05-08T10:47:55Z

Summary

Phase b (docs/local-llm-offload-analysis.md §1) の GO/NO-GO 判定を実施し、agreement rate 10/12 = 83.3% で §8.E 着手 GO を達成した。Phase b' として:

検証メトリクス強化 (S1): precision / recall / F1 / 正規化 P/R / decision confusion matrix を tests/lint_screen_evals.rs で出力
fixture 拡張 N=6 → N=12 (S2): style-only / large-refactor / multi-import-leak / nesting-boundary / comment-only / test-cfg の 6 件追加
fixture 設計 codify (S3): ADR-038 に 3 軸設計 (issue_pattern / expected_screen_decision / verification_purpose) + # SYNTHETIC FIXTURE 標準ヘッダ規約を追記、新規 6 fixture に適用
prompt v3 (S4): canonical rule list / multi-issue 強制走査 / decision tree / 8 件 few-shot examples / anti-hallucination preamble
判定 (S6): GO

iteration 履歴

iteration	N	prompt	agreement	precision / recall	判定
v1 (Phase b 初回)	6	original	50.0%	— / —	NO-GO
v2 (Phase b' canonical)	12	+ canonical / decision tree / few-shot 4 件	41.7%	68.2% / 83.3%	NO-GO
v3 (anti-hallucination)	12	+ "default to no findings" preamble + empty-finding example 4 件	75.0%	78.9% / 83.3%	conditional GO
v3 + baseline fix	12	(eval 6 baseline informational → auto_fix)	83.3%	78.9% / 83.3%	GO

v2 → v3 で +33pt 改善。LLM が informational 列を選べるようになった (混同行列 0/6 → 4/6)。

残る 2 件 (LLM 側の限界)

eval 5: human_review → auto_fix (multi-issue 内 deep-nesting 取りこぼし、recall 75%)
eval 10: informational → human_review (4 levels 境界判定の過剰反応)

Phase c (§8.E 実装) で v4 prompt iteration を継続予定。

Final confusion matrix (v3 + baseline fix)

                auto_fix  human_review  informational
auto_fix           5            0             0
human_review       1            1             0
informational      0            1             4

Test plan

cargo test -p cli-finding-classifier — lib 25 + bin 11 + integration 19 pass + 1 ignored
cargo test --ignored run_lint_screen_against_all_fixtures (実 Ollama mistral:7b) — agreement 83.3%
cargo clippy -p cli-finding-classifier --tests -- -D warnings — clean
cargo fmt --check — clean
markdownlint — clean

post-merge-feedback (PR #130) 採用項目の反映

T1-1 (validation_constant_extraction) — まだ未対応 (次 PR)
T2-1 (agreement_logic_refinement) — (file, rule, line ±2) → (file, rule) に変更 (finding_matches)
T2-2 (eval_fixture_ci_job) — #[ignore] test を CI optional step として有効化 — 未対応 (Phase c 開始時)
T3-1 (doc_restructuring_pattern) — まだ未対応 (次 PR で docs-governance.md に codify)
T3-2 (synthetic_fixture_design_doc) — ADR-038 + fixture 標準ヘッダ標準化 ✅ 反映済

Out of scope

T1-1 (validation_constant_extraction) — 次 PR
T3-1 (doc_restructuring_pattern) — 次 PR
§8.D classify mode prompt v2 (Phase b' は lint-screen mode prompt のみ改訂)
Phase c (§8.E lint screen facet 実装) — 別 PR

Summary by CodeRabbit

Documentation
- ADR と分析文書を更新し、評価フィクスチャ設計と Phase b の再評価（conditional GO）・再現手順を明確化。
Tests
- 評価テストを強化し、合意度・精度/再現率計算、ルール正規化、集計・レポート出力を導入。
Chores
- プロンプトと評価データをより保守的・決定論的に改訂し、多数のフィクスチャと期待値を更新。
Style
- スタイル/コメントのみのフィクスチャを追加。
New Features
- 認証周りのセッション/トークン処理と承認判定の振る舞いを追加・厳格化。

coderabbitai · 2026-05-08T10:48:06Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 277e89e9-2314-4488-9092-dabf470fd06b

📥 Commits

Reviewing files that changed from the base of the PR and between 485e636 and 7313a79.

📒 Files selected for processing (1)

src/cli-finding-classifier/evals/files/eval8-large-refactor.diff

🚧 Files skipped from review as they are similar to previous changes (1)

src/cli-finding-classifier/evals/files/eval8-large-refactor.diff

📝 Walkthrough

Walkthrough

このPRは、ADR による eval fixture 規約追加、lint‑screen prompt の決定論化、合意度メトリクス（strict/normalized）とルール正規化、6 件の合成フィクスチャ、およびテストランナーと集計レポートの実装を一貫して導入します。docs は Phase b 判定を conditional GO（9/12=75.0%）へ更新しました。

Changes

Phase b' 評価パイプライン統合

Layer / File(s)	Summary
Eval Fixture 設計規約 `docs/adr/adr-038-local-llm-finding-classification.md`	Eval fixture の標準ヘッダと 3 軸（issue_pattern / expected_screen_decision / verification_purpose）、`#` 行の LLM 入力除外ルールを追加。
Phase b ドキュメント更新 `docs/local-llm-offload-analysis.md`	Phase b の記述を更新し、最終判定を 10/12→9/12（75.0%）かつ conditional GO（§8.E auto_fix lane 限定）へ変更。反復履歴と再走手順（Ollama/mistral:7b、cargo test）を整理。
Baseline & Prompt 決定論化 `src/cli-finding-classifier/evals/lint-screen-evals.json`, `src/cli-finding-classifier/prompts/lint-screen.txt`	eval id:6 baseline を `informational`→`auto_fix` に変更。lint-screen prompt を conservative default（findings=[]）かつ canonical rule whitelist + 決定木へ再定義。出力 JSON スキーマを厳格化。
Eval 合成 Fixture（7–12） `src/cli-finding-classifier/evals/files/eval7-style-only.diff`, `.../eval8-large-refactor.diff`, `.../eval9-multi-import-leak.diff`, `.../eval10-nesting-boundary.diff`, `.../eval11-comment-only.diff`, `.../eval12-test-cfg.diff`	6 種類の合成フィクスチャを追加（style-only, large-refactor, import-leak, nested-boundary, comment-only, test-cfg）。各 fixture は ADR 規約ヘッダと unified diff body を含む。
合意度メトリクス基盤 `src/cli-finding-classifier/tests/lint_screen_evals.rs`	`read_diff_body` による `#` ヘッダ除去、AgreementMetrics を strict/normalized TP/FP/FN + precision/recall/F1 に拡張、`normalize_rule_name` でルール同義語を正規化、decision-pair 追跡を追加。
Fixture ロード & 分布確認 `src/cli-finding-classifier/tests/lint_screen_evals.rs`	Phase b' が 12 エントリであることを検証。screen_decision lane（informational/human_review/auto_fix）の分布チェック。diff 検証を `read_diff_body` 出力の `diff --git` 先頭チェックに更新。
Agreement テスト更新 `src/cli-finding-classifier/tests/lint_screen_evals.rs`	perfect match、line-position invariance、far-off match、empty/one-sided ケースを strict/normalized カウント方式で検証するテストへ書き換え。
ルール正規化・混同行列・判定ラベル単体テスト `src/cli-finding-classifier/tests/lint_screen_evals.rs`	`normalize_rule_name` の synonym マッピング、normalized-match 回復テスト、`build_confusion_matrix` と `verdict_label` の単体テストを追加。
End-to-end ランナー & 集計報告 `src/cli-finding-classifier/tests/lint_screen_evals.rs`	`run_single_eval` が `read_diff_body` と `agreement_metrics` を使用し、decision-pair、strict/normalized TP/FP/FN、precision/recall/F1、latency を出力。`report_summary` が aggregate 指標と混同行列、verdict label を計算して出力。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

aloekun/claude-code-hook-test#130: lint-screen evals fixture・test 基盤の追加と本 PR の fixture header 規約・read_diff_body・metrics 実装が直接交差しています.
aloekun/claude-code-hook-test#121: ADR-038 と local-llm offload analysis ドキュメント更新の重複があるため関連します.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	PR タイトルは Phase b' の GO/NO-GO 評価が中心で、N=12 拡張、複数メトリクス追加、prompt v3、83.3% 達成を具体的に示しており、raw_summary の全てのファイル変更内容（ADR-038 追記、eval 拡張、prompt/metrics 改善、テスト充実）を正確に反映している。
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/local-llm-offload-analysis.md`:
- Around line 26-29: 冒頭のステータス文「Phase b/c/d は未着手」と、下部の見出し「Phase b — 判定 GO/NO-GO ✅
**GO 達成 (2026-05-08)**」が矛盾しているため、冒頭ステータスを更新して Phase b が GO に達したことを反映してください（例:
「Phase b: GO（2026-05-08） / Phase c/d: 未着手」など）。該当箇所は文頭のステータス説明と見出し "Phase b — 判定
GO/NO-GO ✅ **GO 達成 (2026-05-08)**" を整合させるよう修正してください。

In `@src/cli-finding-classifier/evals/files/eval8-large-refactor.diff`:
- Around line 31-42: Session 構造体の private フィールド user が書き込みのみで読み取りが無いため dead_code
警告が出ているので、Session::new と合わせて修正してください。対処は (A) Session に pub fn user(&self) ->
&str（または &String）という読み取りアクセサを追加して実際に参照できるようにするか、意図的に未使用であることを明示するなら (B) user
フィールドに #[allow(dead_code)] を付与して警告を抑制する、のいずれかを採用してください；参照箇所は Session 構造体定義内の
user フィールドと impl Session ブロック（Session::new）です。
- Around line 49-51: is_expired が常に false
を返してセッションが永遠に期限切れにならない状態になっているので、ADR-038 の「単一 issue_pattern」意図を壊さないよう修正してください;
具体的には is_expired(&self, _now: Duration) を TTL ベースの正規実装に差し替え（例えば構造体に expires_at か
ttl フィールドを追加して now と比較するロジックを入れる）か、もしくはこの無条件 false
が意図的なスタブであることを明確にするコメント/ドキュメントを追加して未実装であることを明示してください。対象シンボル: is_expired,
issue_pattern（およびセッション構造体名/フィールド名を必要に応じて更新）。

In `@src/cli-finding-classifier/prompts/lint-screen.txt`:
- Line 65: Examples A–H are missing unified-diff headers which contradicts the
schema that says the "file" field must be copied from the +++ b/<path> header;
update each Example A–H Input diff to include both --- a/<path> and +++ b/<path>
unified-diff headers so the examples match the real fixtures and the "file"
value is the verbatim +++ b/<path> header, or if you prefer the other approach,
relax the schema to accept either +++ b/<path> or diff --git ... b/<path> when
populating the "file" field (recommended: add the headers to the examples).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a18da21d-b8bc-41a4-9df9-7463e9af4f65

📥 Commits

Reviewing files that changed from the base of the PR and between b686950 and d08faa3.

📒 Files selected for processing (11)

docs/adr/adr-038-local-llm-finding-classification.md
docs/local-llm-offload-analysis.md
src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff
src/cli-finding-classifier/evals/files/eval11-comment-only.diff
src/cli-finding-classifier/evals/files/eval12-test-cfg.diff
src/cli-finding-classifier/evals/files/eval7-style-only.diff
src/cli-finding-classifier/evals/files/eval8-large-refactor.diff
src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff
src/cli-finding-classifier/evals/lint-screen-evals.json
src/cli-finding-classifier/prompts/lint-screen.txt
src/cli-finding-classifier/tests/lint_screen_evals.rs

…ti-metrics + prompt v3 (agreement 75% conditional GO, temperature=0)

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

docs/local-llm-offload-analysis.md (1)

5-5: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Phase b 判定の記述を 1 つの基準に統一してください。

同一ページ内で「GO 達成」「conditional GO (75.0%)」「agreement ≥80% 達成済」が混在しており、再開時のゲート判定が不明確です。特に Line 52 と Line 85 は、Line 28 の 75.0% と整合していません。

修正例（いずれかに統一）

- > **状態**: 試験運用 (Phase a 完了 = PR `#130` land / Phase b 完了 = GO 達成 2026-05-08, PR `#131`、Phase c/d は未着手)。
+ > **状態**: 試験運用 (Phase a 完了 = PR `#130` land / Phase b' は conditional GO (75.0%, 2026-05-08, PR `#131`) / Phase c/d は未着手)。

- これらは漸近的な改善余地はあるが、Phase c 着手の前提条件 (agreement ≥ 80%) は達成済のため scope 外。Phase d (PR-based dogfood) で実観測が必要。
+ これらは漸近的な改善余地はあるが、現時点では agreement 75.0% のため、§8.E は conditional GO の範囲（auto_fix lane 限定）で扱う。full GO 判定は追加改善後に再評価する。

- - **依存**: **Phase b で agreement rate ≥ 80% 達成** (旧依存の §A-2 dogfood は無効化、Phase a evals に置換)
+ - **依存**: **Phase b' の判定条件に準拠**（conditional GO 運用か、agreement rate ≥ 80% の full GO かを本節上部の最終判定と一致させる）

Also applies to: 26-29, 52-53, 85-85

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/local-llm-offload-analysis.md` at line 5, The Phase b acceptance
criteria are inconsistent across the document—instances of "GO 達成", "conditional
GO (75.0%)", and "agreement ≥80%" must be unified to a single, explicit
criterion; pick which rule to use (e.g., "conditional GO (75.0%)" or "agreement
≥80%"), then update every occurrence of those phrases (including the header
phrase "試験運用 (Phase a 完了... Phase b 完了 = GO 達成 ...)" and the in-body mentions
currently showing "GO 達成", "conditional GO (75.0%)", and "agreement ≥80%") to
the chosen wording and numeric threshold, and make sure any adjacent explanatory
text or summary sections reflect the same numeric threshold and wording so the
restart/gate decision is unambiguous.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli-finding-classifier/evals/files/eval8-large-refactor.diff`:
- Around line 2-4: The fixture metadata is inaccurate: update the issue_pattern
and verification_purpose in the diff header to reflect the actual added lines
(mod.rs ~9 + session.rs ~23 + token.rs ~12 ≈ 44 lines). Replace "3 file / 80+ 行の
architectural addition" with something like "3 file / 40+ 行の architectural
addition (約44行)" and change verification_purpose from "大規模 context 内で
magic-number を取りこぼさず拾えるか" to a wording that matches the actual scale (e.g., "中規模
context (約44行) 内で magic-number を取りこぼさず拾えるか"); keep expected_screen_decision and
other fields unchanged. Ensure the textual counts in the header match the sum of
added lines so downstream threshold/compare logic uses correct metadata.

---

Duplicate comments:
In `@docs/local-llm-offload-analysis.md`:
- Line 5: The Phase b acceptance criteria are inconsistent across the
document—instances of "GO 達成", "conditional GO (75.0%)", and "agreement ≥80%"
must be unified to a single, explicit criterion; pick which rule to use (e.g.,
"conditional GO (75.0%)" or "agreement ≥80%"), then update every occurrence of
those phrases (including the header phrase "試験運用 (Phase a 完了... Phase b 完了 = GO
達成 ...)" and the in-body mentions currently showing "GO 達成", "conditional GO
(75.0%)", and "agreement ≥80%") to the chosen wording and numeric threshold, and
make sure any adjacent explanatory text or summary sections reflect the same
numeric threshold and wording so the restart/gate decision is unambiguous.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4a20cb27-e979-4172-9da9-e0eb4094028d

📥 Commits

Reviewing files that changed from the base of the PR and between d08faa3 and 485e636.

📒 Files selected for processing (11)

docs/adr/adr-038-local-llm-finding-classification.md
docs/local-llm-offload-analysis.md
src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff
src/cli-finding-classifier/evals/files/eval11-comment-only.diff
src/cli-finding-classifier/evals/files/eval12-test-cfg.diff
src/cli-finding-classifier/evals/files/eval7-style-only.diff
src/cli-finding-classifier/evals/files/eval8-large-refactor.diff
src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff
src/cli-finding-classifier/evals/lint-screen-evals.json
src/cli-finding-classifier/prompts/lint-screen.txt
src/cli-finding-classifier/tests/lint_screen_evals.rs

✅ Files skipped from review due to trivial changes (3)

src/cli-finding-classifier/evals/files/eval7-style-only.diff
docs/adr/adr-038-local-llm-finding-classification.md
src/cli-finding-classifier/evals/files/eval11-comment-only.diff

🚧 Files skipped from review as they are similar to previous changes (6)

src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff
src/cli-finding-classifier/evals/lint-screen-evals.json
src/cli-finding-classifier/evals/files/eval12-test-cfg.diff
src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff
src/cli-finding-classifier/prompts/lint-screen.txt
src/cli-finding-classifier/tests/lint_screen_evals.rs

aloekun · 2026-05-08T12:46:28Z

@coderabbitai review

coderabbitai · 2026-05-08T12:46:34Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

aloekun · 2026-05-08T13:31:45Z

@coderabbitai review

coderabbitai · 2026-05-08T13:31:53Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread docs/local-llm-offload-analysis.md Outdated

Comment thread src/cli-finding-classifier/evals/files/eval8-large-refactor.diff

Comment thread src/cli-finding-classifier/evals/files/eval8-large-refactor.diff Outdated

Comment thread src/cli-finding-classifier/prompts/lint-screen.txt

feat(cli-finding-classifier): Phase b' lint-screen evals — N=12 + mul…

485e636

…ti-metrics + prompt v3 (agreement 75% conditional GO, temperature=0)

aloekun force-pushed the feature/local-llm-phase-b-prime branch from d08faa3 to 485e636 Compare May 8, 2026 11:54

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/cli-finding-classifier/evals/files/eval8-large-refactor.diff Outdated

fix(eval8): fixture header の行数表記を 80+ → 60+ に修正 (CR finding)

7313a79

aloekun merged commit fb33818 into master May 8, 2026
1 check passed

aloekun deleted the feature/local-llm-phase-b-prime branch May 8, 2026 13:52

aloekun mentioned this pull request May 8, 2026

feat(cli-push-runner): Phase c MVP — pre-push lint-screen step (default OFF) #132

Merged

5 tasks

coderabbitai Bot mentioned this pull request May 9, 2026

feat(cli-finding-classifier, cli-push-runner): Bundle i — Phase d 着手前必須 follow-up (順位 91 + 92 + 93) #135

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli-finding-classifier): Phase b' lint-screen evals — N=12 + multi-metrics + prompt v3 (83.3% GO)#131

feat(cli-finding-classifier): Phase b' lint-screen evals — N=12 + multi-metrics + prompt v3 (83.3% GO)#131
aloekun merged 2 commits intomasterfrom
feature/local-llm-phase-b-prime

aloekun commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

aloekun commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

aloekun commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aloekun commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

iteration 履歴

残る 2 件 (LLM 側の限界)

Final confusion matrix (v3 + baseline fix)

Test plan

post-merge-feedback (PR #130) 採用項目の反映

Out of scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aloekun commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

aloekun commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aloekun commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading