Skip to content

feat(cli-finding-classifier): Phase a lint-screen evals infrastructure#130

Merged
aloekun merged 4 commits intomasterfrom
feature/local-llm-phase-a-evals
May 8, 2026
Merged

feat(cli-finding-classifier): Phase a lint-screen evals infrastructure#130
aloekun merged 4 commits intomasterfrom
feature/local-llm-phase-a-evals

Conversation

@aloekun
Copy link
Copy Markdown
Owner

@aloekun aloekun commented May 8, 2026

Summary

docs/local-llm-offload-analysis.md §11.6 で策定した Phase a (evals infrastructure 整備) を land。 §A-2 PR-based dogfood で発覚した classifier 妥当性検証の構造的限界 (3 種の阻害要因 — findings ゼロ / review body 抽出漏れ / rate-limit) を踏まえ、検証手段を 固定 diff fixture + Claude Code baseline + agreement 突合 の evals 形式に切り替えるための土台を整備する。

  • 2 commits:
    • docs(local-llm-offload): §11 retrospective + Phase a-d phasing
    • feat(cli-finding-classifier): Phase a 実装 (lint-screen mode + fixtures + baseline + integration test)

Deliverables

カテゴリ 配置 内容
fixtures src/cli-finding-classifier/evals/files/ 6 件 — unused-import / deep-nesting / magic-number / clean (FP 検知) / multi-issue / existing-lint-overlap
baseline src/cli-finding-classifier/evals/lint-screen-evals.json Claude Code baseline + expectations + agreement_threshold = 0.8
prompt src/cli-finding-classifier/prompts/lint-screen.txt 出力契約 = { lint_findings, screen_decision }
runner src/cli-finding-classifier/src/{lib,main}.rs --mode lint-screen 追加、screen_diff() 公開 API、fallback は classify mode と同じ human_review + fallback_reason パターン継承
compare src/cli-finding-classifier/tests/lint_screen_evals.rs 常時実行 schema/structure validation 12 件 + #[ignore] 付き Phase b 用 end-to-end runner 1 件

Design notes

  • 既存 binary に --mode lint-screen 追加 (新 crate 不要、ADR-038 の crate を再利用)
  • prompt template は別 file (prompts/lint-screen.txt)、include_str! で同梱
  • compare ロジックは Rust integration test (cargo test で実行可能、CI 統合容易)
  • Phase b GO/NO-GO 判定 (実 Ollama 呼出) は #[ignore] 付き別テストで分離
  • ADR-038 本文は permanent artifact のため変更せず、Phase 進捗は ephemeral な §11.6 にのみ記録 (docs-governance permanent → ephemeral 参照禁止と整合)

Test plan

  • cargo test -p cli-finding-classifier — lib 25 + bin 11 + integration 12 = 48 件 pass + 1 ignored
  • cargo clippy -p cli-finding-classifier --tests -- -D warnings — clean
  • cargo fmt -p cli-finding-classifier --check — clean
  • markdownlint on docs/local-llm-offload-analysis.md — clean
  • Phase b end-to-end (本 PR では未実施、別セッションで実施): cargo test -p cli-finding-classifier --test lint_screen_evals -- --ignored --nocapture run_lint_screen_against_all_fixtures

Out of scope

  • Phase b GO/NO-GO 判定 (実 Ollama 呼出 + agreement rate 計測)
  • Phase c (§8.E ollama-lint-screen takt facet 実装)
  • Phase d (PR-based 実環境 dogfood)
  • style-only / large-refactor 系 fixture (Phase b 結果次第で追加判断)

Summary by CodeRabbit

リリースノート

  • 新機能

    • CLIにモード切替で動作する「リント・スクリーン」ワークフローを追加しました(差分解析→JSON出力)。
  • テスト

    • リント・スクリーン評価用のベンチマークと統合テスト群を追加。合意率計算と多数の検証ケースを含む実行手順を整備しました。
  • ドキュメント

    • フェーズ別実行計画と判断基準(合意率閾値など)を中心に検証手順と履歴を更新しました。

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

Warning

Rate limit exceeded

@aloekun has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 20 minutes and 1 second before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 63d93904-3e87-4867-910a-89a493297c88

📥 Commits

Reviewing files that changed from the base of the PR and between 979359c and 7439aa2.

📒 Files selected for processing (1)
  • docs/local-llm-offload-history.md
📝 Walkthrough

ウォークスルー

PR は lint-screen 分類器の評価駆動型アプローチを実装します:戦略・段階化ドキュメント、6 つの固定サンプル diff と JSON/プロンプト契約、LLM 主導の画面ロジック、モード別 CLI、および合意率測定を含む Phase a 検証テストを追加します。

変更内容

Lint-Screen Evaluation Workflow

レイヤー / ファイル 概要
戦略と段階化
docs/local-llm-offload-analysis.md, docs/local-llm-offload-history.md
残作業 runbook と履歴を追加・書き換え。Phase a 完了の明示、Phase b/c/d の段階化、合意率ゲート(≥80%)や再開チェックリストを定義。
設定参照
pr-monitor-config.toml
pr-monitor の classifier コメント参照を history.md に更新。設定値自体は変更なし。
評価フィクスチャと契約
src/cli-finding-classifier/evals/files/eval1-unused-import.diff, eval2-deep-nesting.diff, eval3-magic-number.diff, eval4-clean.diff, eval5-multi-issue.diff, eval6-existing-lint-overlap.diff, src/cli-finding-classifier/evals/lint-screen-evals.json
6 件の固定 diff フィクスチャと evals JSON を追加し、各ケースの baseline findings、screen_decision、expectations を定義。
Prompt: lint-screen
src/cli-finding-classifier/prompts/lint-screen.txt
+ 行のみ検査、検出パターン(unused-import、deep-nesting、magic-number 等)、除外ルール、single-decision ヒューリスティクス、厳密な JSON 出力スキーマを定義。
Lint-Screen コア実装
src/cli-finding-classifier/src/lib.rs
LintFinding/LintScreenResult 型、build_lint_screen_promptscreen_diff を追加。LLM 出力のパースと decision/severity 検証、Ollama エラーや契約違反時は human_review + fallback_reason を返す。
CLI モード リファクタリング
src/cli-finding-classifier/src/main.rs
`--mode classify
Phase a 検証テスト
src/cli-finding-classifier/tests/lint_screen_evals.rs
evals JSON のスキーマ/フィクスチャ整合性テスト、agreement_metrics(decision 一致と ±2 行トレランスでの finding 重複)、ユニットテスト群、および ignored な end-to-end ランナー(合意率算出・p50/p95 latency 集計・GO/NO-GO 判定)を追加。

推定コード レビュー工数

🎯 3 (Moderate) | ⏱️ ~25 分

関連の可能性のある PR

  • aloekun/claude-code-hook-test#120: src/cli-finding-classifier/src/lib.rs に関する LLM 出力パース/classifier API 変更で強いコードレベルの関連性があります。
  • aloekun/claude-code-hook-test#119: 既存の cli-finding-classifier crate を導入/拡張した以前の PR で、本 PR はその上に lint-screen 機能と evals を追加しています。
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: implementing Phase a lint-screen evals infrastructure with fixtures, baseline, prompt, runner API, and tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/cli-finding-classifier/prompts/lint-screen.txt (1)

47-53: ⚡ Quick win

line フィールドが「新ファイル上の絶対行番号」を要求 — LLM の行番号精度に注意

統一 diff の @@ ヘッダー (+new_start,count) を基点にして + 行ごとにオフセットを累積しないと正確な新ファイル行番号は得られない。LLM はこの計算を誤り diff 内の相対行番号や @@ の値をそのまま返すことがある。

tests/lint_screen_evals.rs の agreement 計算に line フィールドが含まれる場合、行番号のズレが実際の検出精度とは無関係に agreement rate を引き下げる。

対処案:

  • 案 A: agreement 計算の照合キーを (rule, file) に絞り line は参考値扱いとする (最小変更)
  • 案 B: プロンプトを「@@ ヘッダーの + 開始行からの 0-origin オフセット」に変更して LLM の計算負担を下げる
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli-finding-classifier/prompts/lint-screen.txt` around lines 47 - 53, The
prompt currently requires "line" as an absolute new-file line number which LLMs
often miscompute; either (A) change the agreement logic in
tests/lint_screen_evals.rs to ignore "line" for matching (use (rule,file) as the
key and treat "line" as advisory) or (B) update
src/cli-finding-classifier/prompts/lint-screen.txt to redefine "line" to be a
0-origin offset from the @@ header's + start (so LLMs return the simpler
offset). Locate the "line" field in lint-screen.txt and the agreement/matching
code in tests/lint_screen_evals.rs and implement one of these two fixes so
agreement rates are not spuriously lowered by LLM line-number errors.
src/cli-finding-classifier/src/lib.rs (1)

198-199: ⚡ Quick win

バリデーション定数を pub にして重複定義を排除してください

VALID_SCREEN_DECISIONSVALID_LINT_SEVERITIES が非公開のため、tests/lint_screen_evals.rs で同名(かつ VALID_SEVERITIES と命名が異なる)の定数が別途定義されています。この重複により、例えば新しい severity を lib に追加したが test 側を更新し忘れた場合、フィクスチャのスキーマ検証テストは通過しても、実行時に lib のバリデーションで拒否されるという無音の乖離が生じます。

♻️ 修正案
-const VALID_SCREEN_DECISIONS: &[&str] = &["auto_fix", "human_review", "informational"];
-const VALID_LINT_SEVERITIES: &[&str] = &["minor", "major", "critical"];
+pub const VALID_SCREEN_DECISIONS: &[&str] = &["auto_fix", "human_review", "informational"];
+pub const VALID_LINT_SEVERITIES: &[&str] = &["minor", "major", "critical"];

tests/lint_screen_evals.rs 側では重複定義を削除し、インポートで差し替えます:

-use cli_finding_classifier::{LintFinding, LintScreenResult};
+use cli_finding_classifier::{LintFinding, LintScreenResult, VALID_SCREEN_DECISIONS, VALID_LINT_SEVERITIES};

-const VALID_SCREEN_DECISIONS: &[&str] = &["auto_fix", "human_review", "informational"];
-const VALID_SEVERITIES: &[&str] = &["minor", "major", "critical"];

また、lint_screen_evals.rs の参照箇所 (VALID_SEVERITIES) を VALID_LINT_SEVERITIES に統一してください。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli-finding-classifier/src/lib.rs` around lines 198 - 199, Make the
validation arrays public and remove the duplicate test definitions: change the
constants VALID_SCREEN_DECISIONS and VALID_LINT_SEVERITIES in lib.rs to be pub
so tests can import them, then delete the local constants in
tests/lint_screen_evals.rs, import the pub constants from the crate, and update
any test references using VALID_SEVERITIES to use VALID_LINT_SEVERITIES to keep
names consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/local-llm-offload-analysis.md`:
- Around line 667-683: The §11.1 text claims PR `#125`〜#129 (P-1〜P-5) and shows a
2/5 起動率, but the §A-2 計測ログ P-5 row is still blank; verify whether P-5 (PR `#129`)
has actually landed and then either (A) if landed, populate the §A-2 log P-5 row
with the correct PR number and findings values and ensure the aggregate table
values (classifier 起動率, 阻害要因観測数, agreement rate, 平均 latency) reflect the added
data, or (B) if not landed, update §11.1 to correct the PR range (e.g., PR
`#125`〜#128 or P-1〜P-4) and recompute the aggregate metrics so the narrative and
the §A-2 計測ログ stay consistent; look for the §A-2 計測ログ block and the §11.1
header/summary to make these edits (references: "§A-2 計測ログ", "§11.1", "P-5", "PR
`#125`〜#129", the summary table).

In `@src/cli-finding-classifier/evals/files/eval3-magic-number.diff`:
- Around line 17-20: The guard delay_ms > 30000 is unreachable because delay_ms
is computed as 100 * (1 << attempt) with attempt coming from the small range
0..5 (so max delay 1600); update the code to make the condition reachable by
increasing the attempt upper bound (e.g., use a larger range like 0..10 or
similar) or lower the 30000 threshold to a value reachable with the existing
range; modify the loop that defines attempt (and/or the threshold constant) so
that delay_ms can exceed 30000 — refer to the delay_ms calculation and the for
attempt in 0..5 loop to locate and fix the logic.

In `@src/cli-finding-classifier/tests/lint_screen_evals.rs`:
- Around line 386-409: The report_summary function currently indexes into
latencies_ms without checking for empty, causing panics when
latencies_ms.is_empty(); update report_summary to handle the empty case early
(e.g., if latencies_ms.is_empty() { compute agreement and print latency p50/p95
as "N/A" or 0 and return }) or guard the indexing by using safe access
(latencies_ms.get(...) with a default) and use saturating/subtraction for p95
index computation; modify uses of p50 and p95 accordingly so report_summary, and
variables latencies_ms, p50, p95, decision_matches, and set.agreement_threshold
no longer panic on empty input.

---

Nitpick comments:
In `@src/cli-finding-classifier/prompts/lint-screen.txt`:
- Around line 47-53: The prompt currently requires "line" as an absolute
new-file line number which LLMs often miscompute; either (A) change the
agreement logic in tests/lint_screen_evals.rs to ignore "line" for matching (use
(rule,file) as the key and treat "line" as advisory) or (B) update
src/cli-finding-classifier/prompts/lint-screen.txt to redefine "line" to be a
0-origin offset from the @@ header's + start (so LLMs return the simpler
offset). Locate the "line" field in lint-screen.txt and the agreement/matching
code in tests/lint_screen_evals.rs and implement one of these two fixes so
agreement rates are not spuriously lowered by LLM line-number errors.

In `@src/cli-finding-classifier/src/lib.rs`:
- Around line 198-199: Make the validation arrays public and remove the
duplicate test definitions: change the constants VALID_SCREEN_DECISIONS and
VALID_LINT_SEVERITIES in lib.rs to be pub so tests can import them, then delete
the local constants in tests/lint_screen_evals.rs, import the pub constants from
the crate, and update any test references using VALID_SEVERITIES to use
VALID_LINT_SEVERITIES to keep names consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c18ec927-fc2e-44f1-a0bb-a97c54f52a91

📥 Commits

Reviewing files that changed from the base of the PR and between eb67bde and 1797a7f.

📒 Files selected for processing (12)
  • docs/local-llm-offload-analysis.md
  • src/cli-finding-classifier/evals/files/eval1-unused-import.diff
  • src/cli-finding-classifier/evals/files/eval2-deep-nesting.diff
  • src/cli-finding-classifier/evals/files/eval3-magic-number.diff
  • src/cli-finding-classifier/evals/files/eval4-clean.diff
  • src/cli-finding-classifier/evals/files/eval5-multi-issue.diff
  • src/cli-finding-classifier/evals/files/eval6-existing-lint-overlap.diff
  • src/cli-finding-classifier/evals/lint-screen-evals.json
  • src/cli-finding-classifier/prompts/lint-screen.txt
  • src/cli-finding-classifier/src/lib.rs
  • src/cli-finding-classifier/src/main.rs
  • src/cli-finding-classifier/tests/lint_screen_evals.rs

Comment thread docs/local-llm-offload-analysis.md Outdated
Comment thread src/cli-finding-classifier/evals/files/eval3-magic-number.diff
Comment thread src/cli-finding-classifier/tests/lint_screen_evals.rs
@aloekun aloekun force-pushed the feature/local-llm-phase-a-evals branch from 1797a7f to 979359c Compare May 8, 2026 06:10
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/cli-finding-classifier/tests/lint_screen_evals.rs (1)

386-409: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

latencies_ms が空の場合に report_summary がパニックします(前回レビューから未解消)

  • Line 388: latencies_ms[latencies_ms.len() / 2] → 空なら len()==0latencies_ms[0] アクセスによりパニック。
  • Line 390: latencies_ms.len() - 1usize アンダーフローで usize::MAX になり、その後の添字アクセスでパニック(.min() は救えない)。

#[ignore] テストで常に 6 件 eval が保証されているため現在は実害なし。ただし前回レビューで指摘された際に対処されていないため、再掲します。

🛡️ 防御的ガードの追加案
 fn report_summary(set: &EvalSet, decision_matches: u32, mut latencies_ms: Vec<u128>) {
+    if latencies_ms.is_empty() {
+        println!("no evals ran; skipping latency summary");
+        return;
+    }
     latencies_ms.sort_unstable();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli-finding-classifier/tests/lint_screen_evals.rs` around lines 386 -
409, The report_summary function can panic when latencies_ms is empty; add a
defensive guard at the top of report_summary to handle latencies_ms.is_empty()
(or len()<1) before sorting/indexing: either return early after printing an
agreement-only summary or compute p50/p95 as "N/A"/0 when empty so you never
index latencies_ms; then only compute p50 and p95 (the current p50 =
latencies_ms[len/2] and p95 using len-1) when latencies_ms.len() > 0, and use
safe min/bounds checks when converting to usize for the p95 index to avoid
underflow. Ensure you modify the report_summary function and its uses
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/local-llm-offload-history.md`:
- Around line 354-355: The jq filter contains an extra closing parenthesis after
the test(...) call causing a parse error; edit the filter string
'[.classified_findings[] | select(.normalized_issue) | .normalized_issue |
test("[a-zA-Z]{8,}"))]' to remove the stray ')' so the test("[a-zA-Z]{8,}") call
is properly closed (i.e., ensure the expression ends with a single closing
bracket ]), updating the line that builds the jq expression using test(...) /
.normalized_issue.

---

Duplicate comments:
In `@src/cli-finding-classifier/tests/lint_screen_evals.rs`:
- Around line 386-409: The report_summary function can panic when latencies_ms
is empty; add a defensive guard at the top of report_summary to handle
latencies_ms.is_empty() (or len()<1) before sorting/indexing: either return
early after printing an agreement-only summary or compute p50/p95 as "N/A"/0
when empty so you never index latencies_ms; then only compute p50 and p95 (the
current p50 = latencies_ms[len/2] and p95 using len-1) when latencies_ms.len() >
0, and use safe min/bounds checks when converting to usize for the p95 index to
avoid underflow. Ensure you modify the report_summary function and its uses
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee52a6ad-f6cf-4e03-b5ba-19e826880829

📥 Commits

Reviewing files that changed from the base of the PR and between 1797a7f and 979359c.

📒 Files selected for processing (14)
  • docs/local-llm-offload-analysis.md
  • docs/local-llm-offload-history.md
  • pr-monitor-config.toml
  • src/cli-finding-classifier/evals/files/eval1-unused-import.diff
  • src/cli-finding-classifier/evals/files/eval2-deep-nesting.diff
  • src/cli-finding-classifier/evals/files/eval3-magic-number.diff
  • src/cli-finding-classifier/evals/files/eval4-clean.diff
  • src/cli-finding-classifier/evals/files/eval5-multi-issue.diff
  • src/cli-finding-classifier/evals/files/eval6-existing-lint-overlap.diff
  • src/cli-finding-classifier/evals/lint-screen-evals.json
  • src/cli-finding-classifier/prompts/lint-screen.txt
  • src/cli-finding-classifier/src/lib.rs
  • src/cli-finding-classifier/src/main.rs
  • src/cli-finding-classifier/tests/lint_screen_evals.rs
✅ Files skipped from review due to trivial changes (4)
  • pr-monitor-config.toml
  • src/cli-finding-classifier/evals/lint-screen-evals.json
  • src/cli-finding-classifier/evals/files/eval3-magic-number.diff
  • src/cli-finding-classifier/evals/files/eval1-unused-import.diff
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/cli-finding-classifier/evals/files/eval4-clean.diff
  • src/cli-finding-classifier/evals/files/eval6-existing-lint-overlap.diff
  • src/cli-finding-classifier/evals/files/eval5-multi-issue.diff
  • docs/local-llm-offload-analysis.md
  • src/cli-finding-classifier/src/lib.rs

Comment thread docs/local-llm-offload-history.md Outdated
@aloekun aloekun merged commit b686950 into master May 8, 2026
1 check passed
@aloekun aloekun deleted the feature/local-llm-phase-a-evals branch May 8, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant