feat(cli-finding-classifier): Phase a lint-screen evals infrastructure#130
feat(cli-finding-classifier): Phase a lint-screen evals infrastructure#130
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughウォークスルーPR は lint-screen 分類器の評価駆動型アプローチを実装します:戦略・段階化ドキュメント、6 つの固定サンプル diff と JSON/プロンプト契約、LLM 主導の画面ロジック、モード別 CLI、および合意率測定を含む Phase a 検証テストを追加します。 変更内容Lint-Screen Evaluation Workflow
推定コード レビュー工数🎯 3 (Moderate) | ⏱️ ~25 分 関連の可能性のある PR
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/cli-finding-classifier/prompts/lint-screen.txt (1)
47-53: ⚡ Quick win
lineフィールドが「新ファイル上の絶対行番号」を要求 — LLM の行番号精度に注意統一 diff の
@@ヘッダー (+new_start,count) を基点にして+行ごとにオフセットを累積しないと正確な新ファイル行番号は得られない。LLM はこの計算を誤り diff 内の相対行番号や@@の値をそのまま返すことがある。
tests/lint_screen_evals.rsの agreement 計算にlineフィールドが含まれる場合、行番号のズレが実際の検出精度とは無関係に agreement rate を引き下げる。対処案:
- 案 A: agreement 計算の照合キーを
(rule, file)に絞りlineは参考値扱いとする (最小変更)- 案 B: プロンプトを「
@@ヘッダーの+開始行からの 0-origin オフセット」に変更して LLM の計算負担を下げる🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli-finding-classifier/prompts/lint-screen.txt` around lines 47 - 53, The prompt currently requires "line" as an absolute new-file line number which LLMs often miscompute; either (A) change the agreement logic in tests/lint_screen_evals.rs to ignore "line" for matching (use (rule,file) as the key and treat "line" as advisory) or (B) update src/cli-finding-classifier/prompts/lint-screen.txt to redefine "line" to be a 0-origin offset from the @@ header's + start (so LLMs return the simpler offset). Locate the "line" field in lint-screen.txt and the agreement/matching code in tests/lint_screen_evals.rs and implement one of these two fixes so agreement rates are not spuriously lowered by LLM line-number errors.src/cli-finding-classifier/src/lib.rs (1)
198-199: ⚡ Quick winバリデーション定数を
pubにして重複定義を排除してください
VALID_SCREEN_DECISIONSとVALID_LINT_SEVERITIESが非公開のため、tests/lint_screen_evals.rsで同名(かつVALID_SEVERITIESと命名が異なる)の定数が別途定義されています。この重複により、例えば新しい severity を lib に追加したが test 側を更新し忘れた場合、フィクスチャのスキーマ検証テストは通過しても、実行時に lib のバリデーションで拒否されるという無音の乖離が生じます。♻️ 修正案
-const VALID_SCREEN_DECISIONS: &[&str] = &["auto_fix", "human_review", "informational"]; -const VALID_LINT_SEVERITIES: &[&str] = &["minor", "major", "critical"]; +pub const VALID_SCREEN_DECISIONS: &[&str] = &["auto_fix", "human_review", "informational"]; +pub const VALID_LINT_SEVERITIES: &[&str] = &["minor", "major", "critical"];
tests/lint_screen_evals.rs側では重複定義を削除し、インポートで差し替えます:-use cli_finding_classifier::{LintFinding, LintScreenResult}; +use cli_finding_classifier::{LintFinding, LintScreenResult, VALID_SCREEN_DECISIONS, VALID_LINT_SEVERITIES}; -const VALID_SCREEN_DECISIONS: &[&str] = &["auto_fix", "human_review", "informational"]; -const VALID_SEVERITIES: &[&str] = &["minor", "major", "critical"];また、
lint_screen_evals.rsの参照箇所 (VALID_SEVERITIES) をVALID_LINT_SEVERITIESに統一してください。🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli-finding-classifier/src/lib.rs` around lines 198 - 199, Make the validation arrays public and remove the duplicate test definitions: change the constants VALID_SCREEN_DECISIONS and VALID_LINT_SEVERITIES in lib.rs to be pub so tests can import them, then delete the local constants in tests/lint_screen_evals.rs, import the pub constants from the crate, and update any test references using VALID_SEVERITIES to use VALID_LINT_SEVERITIES to keep names consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/local-llm-offload-analysis.md`:
- Around line 667-683: The §11.1 text claims PR `#125`〜#129 (P-1〜P-5) and shows a
2/5 起動率, but the §A-2 計測ログ P-5 row is still blank; verify whether P-5 (PR `#129`)
has actually landed and then either (A) if landed, populate the §A-2 log P-5 row
with the correct PR number and findings values and ensure the aggregate table
values (classifier 起動率, 阻害要因観測数, agreement rate, 平均 latency) reflect the added
data, or (B) if not landed, update §11.1 to correct the PR range (e.g., PR
`#125`〜#128 or P-1〜P-4) and recompute the aggregate metrics so the narrative and
the §A-2 計測ログ stay consistent; look for the §A-2 計測ログ block and the §11.1
header/summary to make these edits (references: "§A-2 計測ログ", "§11.1", "P-5", "PR
`#125`〜#129", the summary table).
In `@src/cli-finding-classifier/evals/files/eval3-magic-number.diff`:
- Around line 17-20: The guard delay_ms > 30000 is unreachable because delay_ms
is computed as 100 * (1 << attempt) with attempt coming from the small range
0..5 (so max delay 1600); update the code to make the condition reachable by
increasing the attempt upper bound (e.g., use a larger range like 0..10 or
similar) or lower the 30000 threshold to a value reachable with the existing
range; modify the loop that defines attempt (and/or the threshold constant) so
that delay_ms can exceed 30000 — refer to the delay_ms calculation and the for
attempt in 0..5 loop to locate and fix the logic.
In `@src/cli-finding-classifier/tests/lint_screen_evals.rs`:
- Around line 386-409: The report_summary function currently indexes into
latencies_ms without checking for empty, causing panics when
latencies_ms.is_empty(); update report_summary to handle the empty case early
(e.g., if latencies_ms.is_empty() { compute agreement and print latency p50/p95
as "N/A" or 0 and return }) or guard the indexing by using safe access
(latencies_ms.get(...) with a default) and use saturating/subtraction for p95
index computation; modify uses of p50 and p95 accordingly so report_summary, and
variables latencies_ms, p50, p95, decision_matches, and set.agreement_threshold
no longer panic on empty input.
---
Nitpick comments:
In `@src/cli-finding-classifier/prompts/lint-screen.txt`:
- Around line 47-53: The prompt currently requires "line" as an absolute
new-file line number which LLMs often miscompute; either (A) change the
agreement logic in tests/lint_screen_evals.rs to ignore "line" for matching (use
(rule,file) as the key and treat "line" as advisory) or (B) update
src/cli-finding-classifier/prompts/lint-screen.txt to redefine "line" to be a
0-origin offset from the @@ header's + start (so LLMs return the simpler
offset). Locate the "line" field in lint-screen.txt and the agreement/matching
code in tests/lint_screen_evals.rs and implement one of these two fixes so
agreement rates are not spuriously lowered by LLM line-number errors.
In `@src/cli-finding-classifier/src/lib.rs`:
- Around line 198-199: Make the validation arrays public and remove the
duplicate test definitions: change the constants VALID_SCREEN_DECISIONS and
VALID_LINT_SEVERITIES in lib.rs to be pub so tests can import them, then delete
the local constants in tests/lint_screen_evals.rs, import the pub constants from
the crate, and update any test references using VALID_SEVERITIES to use
VALID_LINT_SEVERITIES to keep names consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c18ec927-fc2e-44f1-a0bb-a97c54f52a91
📒 Files selected for processing (12)
docs/local-llm-offload-analysis.mdsrc/cli-finding-classifier/evals/files/eval1-unused-import.diffsrc/cli-finding-classifier/evals/files/eval2-deep-nesting.diffsrc/cli-finding-classifier/evals/files/eval3-magic-number.diffsrc/cli-finding-classifier/evals/files/eval4-clean.diffsrc/cli-finding-classifier/evals/files/eval5-multi-issue.diffsrc/cli-finding-classifier/evals/files/eval6-existing-lint-overlap.diffsrc/cli-finding-classifier/evals/lint-screen-evals.jsonsrc/cli-finding-classifier/prompts/lint-screen.txtsrc/cli-finding-classifier/src/lib.rssrc/cli-finding-classifier/src/main.rssrc/cli-finding-classifier/tests/lint_screen_evals.rs
…nt-screen + 6 fixtures + Claude Code baseline + compare integration test)
1797a7f to
979359c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/cli-finding-classifier/tests/lint_screen_evals.rs (1)
386-409:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
latencies_msが空の場合にreport_summaryがパニックします(前回レビューから未解消)
- Line 388:
latencies_ms[latencies_ms.len() / 2]→ 空ならlen()==0でlatencies_ms[0]アクセスによりパニック。- Line 390:
latencies_ms.len() - 1→usizeアンダーフローでusize::MAXになり、その後の添字アクセスでパニック(.min()は救えない)。
#[ignore]テストで常に 6 件 eval が保証されているため現在は実害なし。ただし前回レビューで指摘された際に対処されていないため、再掲します。🛡️ 防御的ガードの追加案
fn report_summary(set: &EvalSet, decision_matches: u32, mut latencies_ms: Vec<u128>) { + if latencies_ms.is_empty() { + println!("no evals ran; skipping latency summary"); + return; + } latencies_ms.sort_unstable();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli-finding-classifier/tests/lint_screen_evals.rs` around lines 386 - 409, The report_summary function can panic when latencies_ms is empty; add a defensive guard at the top of report_summary to handle latencies_ms.is_empty() (or len()<1) before sorting/indexing: either return early after printing an agreement-only summary or compute p50/p95 as "N/A"/0 when empty so you never index latencies_ms; then only compute p50 and p95 (the current p50 = latencies_ms[len/2] and p95 using len-1) when latencies_ms.len() > 0, and use safe min/bounds checks when converting to usize for the p95 index to avoid underflow. Ensure you modify the report_summary function and its uses accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/local-llm-offload-history.md`:
- Around line 354-355: The jq filter contains an extra closing parenthesis after
the test(...) call causing a parse error; edit the filter string
'[.classified_findings[] | select(.normalized_issue) | .normalized_issue |
test("[a-zA-Z]{8,}"))]' to remove the stray ')' so the test("[a-zA-Z]{8,}") call
is properly closed (i.e., ensure the expression ends with a single closing
bracket ]), updating the line that builds the jq expression using test(...) /
.normalized_issue.
---
Duplicate comments:
In `@src/cli-finding-classifier/tests/lint_screen_evals.rs`:
- Around line 386-409: The report_summary function can panic when latencies_ms
is empty; add a defensive guard at the top of report_summary to handle
latencies_ms.is_empty() (or len()<1) before sorting/indexing: either return
early after printing an agreement-only summary or compute p50/p95 as "N/A"/0
when empty so you never index latencies_ms; then only compute p50 and p95 (the
current p50 = latencies_ms[len/2] and p95 using len-1) when latencies_ms.len() >
0, and use safe min/bounds checks when converting to usize for the p95 index to
avoid underflow. Ensure you modify the report_summary function and its uses
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ee52a6ad-f6cf-4e03-b5ba-19e826880829
📒 Files selected for processing (14)
docs/local-llm-offload-analysis.mddocs/local-llm-offload-history.mdpr-monitor-config.tomlsrc/cli-finding-classifier/evals/files/eval1-unused-import.diffsrc/cli-finding-classifier/evals/files/eval2-deep-nesting.diffsrc/cli-finding-classifier/evals/files/eval3-magic-number.diffsrc/cli-finding-classifier/evals/files/eval4-clean.diffsrc/cli-finding-classifier/evals/files/eval5-multi-issue.diffsrc/cli-finding-classifier/evals/files/eval6-existing-lint-overlap.diffsrc/cli-finding-classifier/evals/lint-screen-evals.jsonsrc/cli-finding-classifier/prompts/lint-screen.txtsrc/cli-finding-classifier/src/lib.rssrc/cli-finding-classifier/src/main.rssrc/cli-finding-classifier/tests/lint_screen_evals.rs
✅ Files skipped from review due to trivial changes (4)
- pr-monitor-config.toml
- src/cli-finding-classifier/evals/lint-screen-evals.json
- src/cli-finding-classifier/evals/files/eval3-magic-number.diff
- src/cli-finding-classifier/evals/files/eval1-unused-import.diff
🚧 Files skipped from review as they are similar to previous changes (5)
- src/cli-finding-classifier/evals/files/eval4-clean.diff
- src/cli-finding-classifier/evals/files/eval6-existing-lint-overlap.diff
- src/cli-finding-classifier/evals/files/eval5-multi-issue.diff
- docs/local-llm-offload-analysis.md
- src/cli-finding-classifier/src/lib.rs
Summary
docs/local-llm-offload-analysis.md§11.6 で策定した Phase a (evals infrastructure 整備) を land。 §A-2 PR-based dogfood で発覚した classifier 妥当性検証の構造的限界 (3 種の阻害要因 — findings ゼロ / review body 抽出漏れ / rate-limit) を踏まえ、検証手段を 固定 diff fixture + Claude Code baseline + agreement 突合 の evals 形式に切り替えるための土台を整備する。docs(local-llm-offload): §11 retrospective + Phase a-d phasingfeat(cli-finding-classifier): Phase a 実装 (lint-screen mode + fixtures + baseline + integration test)Deliverables
src/cli-finding-classifier/evals/files/src/cli-finding-classifier/evals/lint-screen-evals.jsonsrc/cli-finding-classifier/prompts/lint-screen.txt{ lint_findings, screen_decision }src/cli-finding-classifier/src/{lib,main}.rs--mode lint-screen追加、screen_diff()公開 API、fallback は classify mode と同じhuman_review + fallback_reasonパターン継承src/cli-finding-classifier/tests/lint_screen_evals.rs#[ignore]付き Phase b 用 end-to-end runner 1 件Design notes
--mode lint-screen追加 (新 crate 不要、ADR-038 の crate を再利用)prompts/lint-screen.txt)、include_str! で同梱cargo testで実行可能、CI 統合容易)#[ignore]付き別テストで分離Test plan
cargo test -p cli-finding-classifier— lib 25 + bin 11 + integration 12 = 48 件 pass + 1 ignoredcargo clippy -p cli-finding-classifier --tests -- -D warnings— cleancargo fmt -p cli-finding-classifier --check— cleandocs/local-llm-offload-analysis.md— cleancargo test -p cli-finding-classifier --test lint_screen_evals -- --ignored --nocapture run_lint_screen_against_all_fixturesOut of scope
ollama-lint-screentakt facet 実装)Summary by CodeRabbit
リリースノート
新機能
テスト
ドキュメント