feat: update analyze CLI to use typed analyzer framework#2370
feat: update analyze CLI to use typed analyzer framework#2370ryan-arman wants to merge 40 commits intomainfrom
Conversation
Port the typed analyzer framework from the analyzer-refactor branch to update the `oumi analyze` CLI. This replaces the old DataFrame-centric approach with Pydantic-based typed result models and aligns the config naming convention with the API's pattern (type/display_name). Key changes: - Rewrite base.py: replace BaseAnalyzer with _AnalyzerMetaMixin + separate base classes (MessageAnalyzer, ConversationAnalyzer, DatasetAnalyzer, PreferenceAnalyzer) using generic type introspection - Add new analyzers: DeduplicationAnalyzer, LLMAnalyzer (with criteria presets: Safety, Usefulness, Factuality, Coherence, InstructionFollowing) - Add custom_metrics.py for user-defined Python metric functions - Add cli.py with ANALYZER_REGISTRY, create_analyzer_from_config factory, and run_typed_analysis pipeline runner - Rewrite cli/analyze.py to use TypedAnalyzeConfig YAML configs - Extend testing engine with TestConfig, PERCENTAGE and RANGE test types - Align config field names with API: AnalyzerConfig uses type/display_name (backward compat: id/instance_id still accepted), TestConfigYAML uses display_name (backward compat: title still accepted) - Preserve API backward compatibility: re-export create_analyzer_from_config, TestParams, BatchTestEngine with original import paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ngine Rewrite tests for the refactored typed analyzer framework covering DataQualityAnalyzer, TurnStatsAnalyzer, and TestEngine with their Pydantic-based result models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite user guide and config reference to document the new typed analyzer system (TypedAnalyzeConfig, type/display_name fields, tests, metric paths). Update example config to include all three built-in analyzers and sample test configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove analyzers that aren't needed at this stage: DeduplicationAnalyzer, LLMAnalyzer (and all convenience subclasses), and llm_criteria. Clean up all imports, registry entries, config enum, and docs references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The analyzer-refactor branch had an earlier/simpler version of LengthAnalyzer that dropped features used by the API (rendered_tokens, tool_total_tokens, LengthAnalyzerConfig). Restore the main branch version and update tests to match its behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same principle as the length.py revert — keep the analyzer that's already working for the API and only change what's needed for CLI alignment. Rewrote test_quality_analyzer.py (27 tests) for main's 5-field DataQualityMetrics and updated docs accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep the analyzer that's already working for the API. Rewrote test_turn_stats_analyzer.py (18 tests) for main's 7-field TurnStatsMetrics and updated docs accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous revert missed two fields (has_no_user_message, has_system_message_not_at_start) that exist on main. Now all three analyzers are identical to main with zero diff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Quality and turn_stats analyzers on main have no configurable params. Removed commented-out references to check_truncation, check_refusals, include_system_in_counts, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LengthAnalyzerConfig and Tokenizer were incorrectly dropped from exports even though length.py still defines them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Restore --dataset_name, --dataset_path, --sample_count CLI overrides - Restore --list for config aliases and --log-level flag - Restore config alias resolution and autocompletion - Add old AnalyzeConfig (v1) format detection with helpful migration message instead of a cryptic crash - Organize help into rich_help_panel groups (Options, Data, Output) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removed references to truncation and policy refusal detection which don't exist in the quality analyzer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code reviewFound 2 issues:
Lines 118 to 120 in b71dcb7 Lines 168 to 170 in b71dcb7 Lines 468 to 470 in b71dcb7 Security gate defined at: oumi/src/oumi/analyze/config.py Lines 397 to 406 in b71dcb7
oumi/src/oumi/analyze/pipeline.py Lines 468 to 483 in b71dcb7 oumi/src/oumi/analyze/pipeline.py Lines 361 to 363 in b71dcb7 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Custom metrics relied on exec() for arbitrary code execution from YAML configs, which posed security risks. Removes CustomMetricConfig, custom_metrics.py, and all references across config, CLI, tests, and docs. Also fixes docs to reflect actual quality/turn_stats analyzer params (neither has configurable constructor params on main). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same-class analyzers without explicit analyzer_id silently overwrote each other in results. Now raises ValueError at construction time, matching the config-level validation in TypedAnalyzeConfig.from_dict(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_extract_metric_values skips None values, which shifted all indices in the returned list. Tests using enumerate() then reported wrong sample_indices (pointing to positions in the filtered list, not original conversation positions). Now returns (original_index, value) tuples so all three test types report correct affected conversation indices. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ge test The percentage test only built failure_reasons for non-matching samples. When max_percentage failed (too many matches), the affected indices were matching ones but failure_reasons had no entries for them, leaving test details empty. Now tracks matching_reasons and non_matching_reasons separately, mirroring the threshold test pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_get_result_type only checked cls.__orig_bases__, which breaks for subclasses of concrete analyzers (e.g. class MyAnalyzer(LengthAnalyzer)) since __orig_bases__ wouldn't contain a generic. Now walks cls.__mro__ to find the generic type argument from any ancestor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BatchTestEngine was exported on main but dropped in this branch. Also restore the core TestType (from oumi.core.configs.params) as the default TestType export for backward compat with the batch engine and API layer. The new engine TestType is available as TypedTestType. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Field existed on main but was dropped in this branch. batch_engine.py still passes it when constructing TestResult; Pydantic v2 silently ignores the unknown kwarg, causing AttributeError downstream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AnalyzerConfig and TestConfigYAML are dataclasses, so typos in YAML
fields raise TypeError ("unexpected keyword argument"). The CLI catches
ValueError with a friendly message but TypeError fell through to a
generic handler. Now wraps both construction sites to convert TypeError
into ValueError listing valid fields.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dataset-level cached results are raw dicts (from JSON), not BaseModel instances. isinstance(result, BaseModel) silently skipped them, dropping dataset-level metrics from the DataFrame. Now also accepts dict for both to_analysis_dataframe and results_to_dict. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
discovery.py imported ANALYZER_REGISTRY from oumi.analyze.cli, creating an inverted dependency (utility module depending on CLI module). Now uses oumi.core.registry.REGISTRY directly, matching the main branch pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Mapping/Sequence in dataframe util signatures for type covariance - Add debug logging to empty except blocks in discovery and engine - Count non-comparable range test values as affected (consistent with threshold/percentage tests) - Add isinstance narrowing in test_pipeline for pyright satisfaction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Both findings have been addressed: 1. Custom metrics via CLI are completely broken 2. Same-class analyzers without explicit |
Move all orchestration logic (run_typed_analysis, save_results, print_summary, data loading, analyzer creation) from the separate analyze/cli.py into the existing cli/analyze.py entry point. - Delete src/oumi/analyze/cli.py and its duplicate ANALYZER_REGISTRY - Use core registry (REGISTRY.get_sample_analyzer) instead - Update analyze/__init__.py to re-export from oumi.cli.analyze - Update tests to use core registry for registration checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop stdlib `import logging` and merge two logger instances into the project-standard `oumi.utils.logging.logger`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The example analyze.yaml uses TypedAnalyzeConfig (v2) format with fields like 'type' and 'display_name' that don't exist in the old AnalyzeConfig schema. Exclude it from the config parse test that tries all old config classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These test files had pre-existing branch changes to fixtures and test bodies that were outside the scope of this PR. Reverted to main to keep the test suite unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fix bugs - Remove PERCENTAGE and RANGE test types and their implementations - Restore _create_error_result helper to reduce inline repetition - Restore _get_nested_value with dict traversal and CustomMetricResult support - Fix threshold field using `is not None` instead of `or` (0.0 was falsy) - Populate all_affected_indices in TestResult construction - Add __test__ = False to prevent pytest collection warnings - Remove corresponding percentage/range tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update analyze config, example YAML, and documentation to reflect that only threshold tests are supported. Convert percentage/range examples to threshold equivalents. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestConfig now accepts both strings and enums for type/severity, removing the need for a separate YAML parsing class. Strings are converted to enums in __init__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tency Remove the local TestConfig dataclass and TestType enum from engine.py in favor of TestParams from oumi.core.configs.params.test_params, which is the canonical test config used throughout the API codebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align with BatchTestEngine behavior — raise instead of silently returning None when encountering a non-BaseModel, non-dict type during metric path traversal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove verbose docstrings, inline narration comments, section dividers, and bloated module docstrings across all files changed in this branch. Fix analyze.md programmatic access example, YAML indentation, and add custom analyzer documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port the oumi analyze CLI from the legacy DatasetAnalyzer path onto the
typed analyzer framework (TypedAnalyzeConfig, AnalysisPipeline,
TestEngine) that already lives in src/oumi/analyze/ on main.
- Rewrite src/oumi/cli/analyze.py to load TypedAnalyzeConfig from YAML,
construct analyzers via the core registry, run AnalysisPipeline, and
optionally execute TestEngine
- Restore --list, --list-metrics, --log-level, --dataset_name,
--dataset_path, --sample_count, --output, --format flags
- Emit a friendly migration error when a v1 AnalyzeConfig YAML is
detected (checks for dataset_source, processor_name, etc.)
- Wire analyze as a nested Typer app in cli/main.py so help panels
render correctly
- Update configs/examples/analyze/analyze.yaml to the v2 schema
- Rewrite docs/user_guides/analyze/{analyze,analyze_config}.md for the
v2 schema
- Exclude configs/examples/analyze/analyze.yaml from the legacy
test_parse_configs sweep (uses TypedAnalyzeConfig, not AnalyzeConfig)
First PR in a 4-PR split of #2370.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove _item_to_conversation's hardcoded prompt/instruction/response/ context key lists and simplify load_conversations_from_dataset to a strict Oumi-format loader (Conversation.from_dict per row, warn-and-skip on failure). Matches the pre-v2 CLI contract of requiring Oumi-shaped data and stays consistent with the api backend, which uses typed schemas or explicit column names rather than field guessing. Update the example YAML and the analyze docs to use placeholder dataset names and note that HF rows must already be in Oumi format; instruction- style datasets should be pre-converted to Oumi JSONL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
analyzer-refactorbranch, replacing the old DataFrame-centric approach with Pydantic-based typed result modelsAnalyzerConfigusestype/display_name(backward compat forid/instance_id),TestConfigYAMLusesdisplay_name(backward compat fortitle)create_analyzer_from_config,TestParams,BatchTestEngineall maintain original import paths and behaviorAPI Compatibility
The API worker uses oumi as a submodule. All existing import patterns are preserved:
from oumi.analyze import AnalysisPipeline, create_analyzer_from_configfrom oumi.analyze.testing import TestParams as OumiTestParamsfrom oumi.analyze.testing.batch_engine import BatchTestEngineanalyzer.analyzer_id = display_namepattern continues to workConfig Changes
YAML configs now use
type/display_name(matching API'sAnalyzerConfigInput):Old format with
id/instance_idis still accepted for backward compatibility.Test plan
oumi analyze -c config.yaml -o /tmp/test -f json)id/instance_id/titlestill parse correctlyanalyze_activity.pyusage patterns🤖 Generated with Claude Code