Skip to content

feat(cli-finding-classifier): Phase b' lint-screen evals — N=12 + multi-metrics + prompt v3 (83.3% GO)#131

Merged
aloekun merged 2 commits intomasterfrom
feature/local-llm-phase-b-prime
May 8, 2026
Merged

feat(cli-finding-classifier): Phase b' lint-screen evals — N=12 + multi-metrics + prompt v3 (83.3% GO)#131
aloekun merged 2 commits intomasterfrom
feature/local-llm-phase-b-prime

Conversation

@aloekun
Copy link
Copy Markdown
Owner

@aloekun aloekun commented May 8, 2026

Summary

Phase b (docs/local-llm-offload-analysis.md §1) の GO/NO-GO 判定を実施し、agreement rate 10/12 = 83.3%§8.E 着手 GO を達成した。Phase b' として:

  • 検証メトリクス強化 (S1): precision / recall / F1 / 正規化 P/R / decision confusion matrix を tests/lint_screen_evals.rs で出力
  • fixture 拡張 N=6 → N=12 (S2): style-only / large-refactor / multi-import-leak / nesting-boundary / comment-only / test-cfg の 6 件追加
  • fixture 設計 codify (S3): ADR-038 に 3 軸設計 (issue_pattern / expected_screen_decision / verification_purpose) + # SYNTHETIC FIXTURE 標準ヘッダ規約を追記、新規 6 fixture に適用
  • prompt v3 (S4): canonical rule list / multi-issue 強制走査 / decision tree / 8 件 few-shot examples / anti-hallucination preamble
  • 判定 (S6): GO

iteration 履歴

iteration N prompt agreement precision / recall 判定
v1 (Phase b 初回) 6 original 50.0% — / — NO-GO
v2 (Phase b' canonical) 12 + canonical / decision tree / few-shot 4 件 41.7% 68.2% / 83.3% NO-GO
v3 (anti-hallucination) 12 + "default to no findings" preamble + empty-finding example 4 件 75.0% 78.9% / 83.3% conditional GO
v3 + baseline fix 12 (eval 6 baseline informational → auto_fix) 83.3% 78.9% / 83.3% GO

v2 → v3 で +33pt 改善。LLM が informational 列を選べるようになった (混同行列 0/6 → 4/6)。

残る 2 件 (LLM 側の限界)

  • eval 5: human_review → auto_fix (multi-issue 内 deep-nesting 取りこぼし、recall 75%)
  • eval 10: informational → human_review (4 levels 境界判定の過剰反応)

Phase c (§8.E 実装) で v4 prompt iteration を継続予定。

Final confusion matrix (v3 + baseline fix)

                auto_fix  human_review  informational
auto_fix           5            0             0
human_review       1            1             0
informational      0            1             4

Test plan

  • cargo test -p cli-finding-classifier — lib 25 + bin 11 + integration 19 pass + 1 ignored
  • cargo test --ignored run_lint_screen_against_all_fixtures (実 Ollama mistral:7b) — agreement 83.3%
  • cargo clippy -p cli-finding-classifier --tests -- -D warnings — clean
  • cargo fmt --check — clean
  • markdownlint — clean

post-merge-feedback (PR #130) 採用項目の反映

  • T1-1 (validation_constant_extraction) — まだ未対応 (次 PR)
  • T2-1 (agreement_logic_refinement) — (file, rule, line ±2)(file, rule) に変更 (finding_matches)
  • T2-2 (eval_fixture_ci_job)#[ignore] test を CI optional step として有効化 — 未対応 (Phase c 開始時)
  • T3-1 (doc_restructuring_pattern) — まだ未対応 (次 PR で docs-governance.md に codify)
  • T3-2 (synthetic_fixture_design_doc) — ADR-038 + fixture 標準ヘッダ標準化 ✅ 反映済

Out of scope

  • T1-1 (validation_constant_extraction) — 次 PR
  • T3-1 (doc_restructuring_pattern) — 次 PR
  • §8.D classify mode prompt v2 (Phase b' は lint-screen mode prompt のみ改訂)
  • Phase c (§8.E lint screen facet 実装) — 別 PR

Summary by CodeRabbit

  • Documentation

    • ADR と分析文書を更新し、評価フィクスチャ設計と Phase b の再評価(conditional GO)・再現手順を明確化。
  • Tests

    • 評価テストを強化し、合意度・精度/再現率計算、ルール正規化、集計・レポート出力を導入。
  • Chores

    • プロンプトと評価データをより保守的・決定論的に改訂し、多数のフィクスチャと期待値を更新。
  • Style

    • スタイル/コメントのみのフィクスチャを追加。
  • New Features

    • 認証周りのセッション/トークン処理と承認判定の振る舞いを追加・厳格化。

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 277e89e9-2314-4488-9092-dabf470fd06b

📥 Commits

Reviewing files that changed from the base of the PR and between 485e636 and 7313a79.

📒 Files selected for processing (1)
  • src/cli-finding-classifier/evals/files/eval8-large-refactor.diff
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cli-finding-classifier/evals/files/eval8-large-refactor.diff

📝 Walkthrough

Walkthrough

このPRは、ADR による eval fixture 規約追加、lint‑screen prompt の決定論化、合意度メトリクス(strict/normalized)とルール正規化、6 件の合成フィクスチャ、およびテストランナーと集計レポートの実装を一貫して導入します。docs は Phase b 判定を conditional GO(9/12=75.0%)へ更新しました。

Changes

Phase b' 評価パイプライン統合

Layer / File(s) Summary
Eval Fixture 設計規約
docs/adr/adr-038-local-llm-finding-classification.md
Eval fixture の標準ヘッダと 3 軸(issue_pattern / expected_screen_decision / verification_purpose)、# 行の LLM 入力除外ルールを追加。
Phase b ドキュメント更新
docs/local-llm-offload-analysis.md
Phase b の記述を更新し、最終判定を 10/12→9/12(75.0%)かつ conditional GO(§8.E auto_fix lane 限定)へ変更。反復履歴と再走手順(Ollama/mistral:7b、cargo test)を整理。
Baseline & Prompt 決定論化
src/cli-finding-classifier/evals/lint-screen-evals.json, src/cli-finding-classifier/prompts/lint-screen.txt
eval id:6 baseline を informationalauto_fix に変更。lint-screen prompt を conservative default(findings=[])かつ canonical rule whitelist + 決定木へ再定義。出力 JSON スキーマを厳格化。
Eval 合成 Fixture(7–12)
src/cli-finding-classifier/evals/files/eval7-style-only.diff, .../eval8-large-refactor.diff, .../eval9-multi-import-leak.diff, .../eval10-nesting-boundary.diff, .../eval11-comment-only.diff, .../eval12-test-cfg.diff
6 種類の合成フィクスチャを追加(style-only, large-refactor, import-leak, nested-boundary, comment-only, test-cfg)。各 fixture は ADR 規約ヘッダと unified diff body を含む。
合意度メトリクス基盤
src/cli-finding-classifier/tests/lint_screen_evals.rs
read_diff_body による # ヘッダ除去、AgreementMetrics を strict/normalized TP/FP/FN + precision/recall/F1 に拡張、normalize_rule_name でルール同義語を正規化、decision-pair 追跡を追加。
Fixture ロード & 分布確認
src/cli-finding-classifier/tests/lint_screen_evals.rs
Phase b' が 12 エントリであることを検証。screen_decision lane(informational/human_review/auto_fix)の分布チェック。diff 検証を read_diff_body 出力の diff --git 先頭チェックに更新。
Agreement テスト更新
src/cli-finding-classifier/tests/lint_screen_evals.rs
perfect match、line-position invariance、far-off match、empty/one-sided ケースを strict/normalized カウント方式で検証するテストへ書き換え。
ルール正規化・混同行列・判定ラベル単体テスト
src/cli-finding-classifier/tests/lint_screen_evals.rs
normalize_rule_name の synonym マッピング、normalized-match 回復テスト、build_confusion_matrixverdict_label の単体テストを追加。
End-to-end ランナー & 集計報告
src/cli-finding-classifier/tests/lint_screen_evals.rs
run_single_evalread_diff_bodyagreement_metrics を使用し、decision-pair、strict/normalized TP/FP/FN、precision/recall/F1、latency を出力。report_summary が aggregate 指標と混同行列、verdict label を計算して出力。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR タイトルは Phase b' の GO/NO-GO 評価が中心で、N=12 拡張、複数メトリクス追加、prompt v3、83.3% 達成を具体的に示しており、raw_summary の全てのファイル変更内容(ADR-038 追記、eval 拡張、prompt/metrics 改善、テスト充実)を正確に反映している。
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/local-llm-offload-analysis.md`:
- Around line 26-29: 冒頭のステータス文「Phase b/c/d は未着手」と、下部の見出し「Phase b — 判定 GO/NO-GO ✅
**GO 達成 (2026-05-08)**」が矛盾しているため、冒頭ステータスを更新して Phase b が GO に達したことを反映してください(例:
「Phase b: GO(2026-05-08) / Phase c/d: 未着手」など)。該当箇所は文頭のステータス説明と見出し "Phase b — 判定
GO/NO-GO ✅ **GO 達成 (2026-05-08)**" を整合させるよう修正してください。

In `@src/cli-finding-classifier/evals/files/eval8-large-refactor.diff`:
- Around line 31-42: Session 構造体の private フィールド user が書き込みのみで読み取りが無いため dead_code
警告が出ているので、Session::new と合わせて修正してください。対処は (A) Session に pub fn user(&self) ->
&str(または &String)という読み取りアクセサを追加して実際に参照できるようにするか、意図的に未使用であることを明示するなら (B) user
フィールドに #[allow(dead_code)] を付与して警告を抑制する、のいずれかを採用してください;参照箇所は Session 構造体定義内の
user フィールドと impl Session ブロック(Session::new)です。
- Around line 49-51: is_expired が常に false
を返してセッションが永遠に期限切れにならない状態になっているので、ADR-038 の「単一 issue_pattern」意図を壊さないよう修正してください;
具体的には is_expired(&self, _now: Duration) を TTL ベースの正規実装に差し替え(例えば構造体に expires_at か
ttl フィールドを追加して now と比較するロジックを入れる)か、もしくはこの無条件 false
が意図的なスタブであることを明確にするコメント/ドキュメントを追加して未実装であることを明示してください。対象シンボル: is_expired,
issue_pattern(およびセッション構造体名/フィールド名を必要に応じて更新)。

In `@src/cli-finding-classifier/prompts/lint-screen.txt`:
- Line 65: Examples A–H are missing unified-diff headers which contradicts the
schema that says the "file" field must be copied from the +++ b/<path> header;
update each Example A–H Input diff to include both --- a/<path> and +++ b/<path>
unified-diff headers so the examples match the real fixtures and the "file"
value is the verbatim +++ b/<path> header, or if you prefer the other approach,
relax the schema to accept either +++ b/<path> or diff --git ... b/<path> when
populating the "file" field (recommended: add the headers to the examples).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a18da21d-b8bc-41a4-9df9-7463e9af4f65

📥 Commits

Reviewing files that changed from the base of the PR and between b686950 and d08faa3.

📒 Files selected for processing (11)
  • docs/adr/adr-038-local-llm-finding-classification.md
  • docs/local-llm-offload-analysis.md
  • src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff
  • src/cli-finding-classifier/evals/files/eval11-comment-only.diff
  • src/cli-finding-classifier/evals/files/eval12-test-cfg.diff
  • src/cli-finding-classifier/evals/files/eval7-style-only.diff
  • src/cli-finding-classifier/evals/files/eval8-large-refactor.diff
  • src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff
  • src/cli-finding-classifier/evals/lint-screen-evals.json
  • src/cli-finding-classifier/prompts/lint-screen.txt
  • src/cli-finding-classifier/tests/lint_screen_evals.rs

Comment thread docs/local-llm-offload-analysis.md Outdated
Comment thread src/cli-finding-classifier/evals/files/eval8-large-refactor.diff
Comment thread src/cli-finding-classifier/evals/files/eval8-large-refactor.diff Outdated
Comment thread src/cli-finding-classifier/prompts/lint-screen.txt
…ti-metrics + prompt v3 (agreement 75% conditional GO, temperature=0)
@aloekun aloekun force-pushed the feature/local-llm-phase-b-prime branch from d08faa3 to 485e636 Compare May 8, 2026 11:54
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/local-llm-offload-analysis.md (1)

5-5: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Phase b 判定の記述を 1 つの基準に統一してください。

同一ページ内で「GO 達成」「conditional GO (75.0%)」「agreement ≥80% 達成済」が混在しており、再開時のゲート判定が不明確です。特に Line 52 と Line 85 は、Line 28 の 75.0% と整合していません。

修正例(いずれかに統一)
- > **状態**: 試験運用 (Phase a 完了 = PR `#130` land / Phase b 完了 = GO 達成 2026-05-08, PR `#131`、Phase c/d は未着手)。
+ > **状態**: 試験運用 (Phase a 完了 = PR `#130` land / Phase b' は conditional GO (75.0%, 2026-05-08, PR `#131`) / Phase c/d は未着手)。

- これらは漸近的な改善余地はあるが、Phase c 着手の前提条件 (agreement ≥ 80%) は達成済のため scope 外。Phase d (PR-based dogfood) で実観測が必要。
+ これらは漸近的な改善余地はあるが、現時点では agreement 75.0% のため、§8.E は conditional GO の範囲(auto_fix lane 限定)で扱う。full GO 判定は追加改善後に再評価する。

- - **依存**: **Phase b で agreement rate ≥ 80% 達成** (旧依存の §A-2 dogfood は無効化、Phase a evals に置換)
+ - **依存**: **Phase b' の判定条件に準拠**(conditional GO 運用か、agreement rate ≥ 80% の full GO かを本節上部の最終判定と一致させる)

Also applies to: 26-29, 52-53, 85-85

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/local-llm-offload-analysis.md` at line 5, The Phase b acceptance
criteria are inconsistent across the document—instances of "GO 達成", "conditional
GO (75.0%)", and "agreement ≥80%" must be unified to a single, explicit
criterion; pick which rule to use (e.g., "conditional GO (75.0%)" or "agreement
≥80%"), then update every occurrence of those phrases (including the header
phrase "試験運用 (Phase a 完了... Phase b 完了 = GO 達成 ...)" and the in-body mentions
currently showing "GO 達成", "conditional GO (75.0%)", and "agreement ≥80%") to
the chosen wording and numeric threshold, and make sure any adjacent explanatory
text or summary sections reflect the same numeric threshold and wording so the
restart/gate decision is unambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli-finding-classifier/evals/files/eval8-large-refactor.diff`:
- Around line 2-4: The fixture metadata is inaccurate: update the issue_pattern
and verification_purpose in the diff header to reflect the actual added lines
(mod.rs ~9 + session.rs ~23 + token.rs ~12 ≈ 44 lines). Replace "3 file / 80+ 行の
architectural addition" with something like "3 file / 40+ 行の architectural
addition (約44行)" and change verification_purpose from "大規模 context 内で
magic-number を取りこぼさず拾えるか" to a wording that matches the actual scale (e.g., "中規模
context (約44行) 内で magic-number を取りこぼさず拾えるか"); keep expected_screen_decision and
other fields unchanged. Ensure the textual counts in the header match the sum of
added lines so downstream threshold/compare logic uses correct metadata.

---

Duplicate comments:
In `@docs/local-llm-offload-analysis.md`:
- Line 5: The Phase b acceptance criteria are inconsistent across the
document—instances of "GO 達成", "conditional GO (75.0%)", and "agreement ≥80%"
must be unified to a single, explicit criterion; pick which rule to use (e.g.,
"conditional GO (75.0%)" or "agreement ≥80%"), then update every occurrence of
those phrases (including the header phrase "試験運用 (Phase a 完了... Phase b 完了 = GO
達成 ...)" and the in-body mentions currently showing "GO 達成", "conditional GO
(75.0%)", and "agreement ≥80%") to the chosen wording and numeric threshold, and
make sure any adjacent explanatory text or summary sections reflect the same
numeric threshold and wording so the restart/gate decision is unambiguous.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4a20cb27-e979-4172-9da9-e0eb4094028d

📥 Commits

Reviewing files that changed from the base of the PR and between d08faa3 and 485e636.

📒 Files selected for processing (11)
  • docs/adr/adr-038-local-llm-finding-classification.md
  • docs/local-llm-offload-analysis.md
  • src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff
  • src/cli-finding-classifier/evals/files/eval11-comment-only.diff
  • src/cli-finding-classifier/evals/files/eval12-test-cfg.diff
  • src/cli-finding-classifier/evals/files/eval7-style-only.diff
  • src/cli-finding-classifier/evals/files/eval8-large-refactor.diff
  • src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff
  • src/cli-finding-classifier/evals/lint-screen-evals.json
  • src/cli-finding-classifier/prompts/lint-screen.txt
  • src/cli-finding-classifier/tests/lint_screen_evals.rs
✅ Files skipped from review due to trivial changes (3)
  • src/cli-finding-classifier/evals/files/eval7-style-only.diff
  • docs/adr/adr-038-local-llm-finding-classification.md
  • src/cli-finding-classifier/evals/files/eval11-comment-only.diff
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/cli-finding-classifier/evals/files/eval10-nesting-boundary.diff
  • src/cli-finding-classifier/evals/lint-screen-evals.json
  • src/cli-finding-classifier/evals/files/eval12-test-cfg.diff
  • src/cli-finding-classifier/evals/files/eval9-multi-import-leak.diff
  • src/cli-finding-classifier/prompts/lint-screen.txt
  • src/cli-finding-classifier/tests/lint_screen_evals.rs

Comment thread src/cli-finding-classifier/evals/files/eval8-large-refactor.diff Outdated
@aloekun
Copy link
Copy Markdown
Owner Author

aloekun commented May 8, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@aloekun
Copy link
Copy Markdown
Owner Author

aloekun commented May 8, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@aloekun aloekun merged commit fb33818 into master May 8, 2026
1 check passed
@aloekun aloekun deleted the feature/local-llm-phase-b-prime branch May 8, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant