feat: update analyze CLI to use typed analyzer framework by ryan-arman · Pull Request #2370 · oumi-ai/oumi

ryan-arman · 2026-04-14T20:11:18Z

Summary

Port the typed analyzer framework from the analyzer-refactor branch, replacing the old DataFrame-centric approach with Pydantic-based typed result models
Align config naming convention with the API's pattern: AnalyzerConfig uses type/display_name (backward compat for id/instance_id), TestConfigYAML uses display_name (backward compat for title)
Add new analyzers (Deduplication, LLM-as-judge with preset criteria), custom metric support, and extended test engine with PERCENTAGE/RANGE test types
Preserve full API backward compatibility: create_analyzer_from_config, TestParams, BatchTestEngine all maintain original import paths and behavior

API Compatibility

The API worker uses oumi as a submodule. All existing import patterns are preserved:

from oumi.analyze import AnalysisPipeline, create_analyzer_from_config
from oumi.analyze.testing import TestParams as OumiTestParams
from oumi.analyze.testing.batch_engine import BatchTestEngine
analyzer.analyzer_id = display_name pattern continues to work

Config Changes

YAML configs now use type/display_name (matching API's AnalyzerConfigInput):

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base

Old format with id/instance_id is still accepted for backward compatibility.

Test plan

All 120 analyze unit tests pass
All 7 CLI analyze tests pass
API import compatibility verified
End-to-end CLI test (oumi analyze -c config.yaml -o /tmp/test -f json)
Backward compat: old-style YAML configs with id/instance_id/title still parse correctly
Review against API's analyze_activity.py usage patterns

🤖 Generated with Claude Code

Port the typed analyzer framework from the analyzer-refactor branch to update the `oumi analyze` CLI. This replaces the old DataFrame-centric approach with Pydantic-based typed result models and aligns the config naming convention with the API's pattern (type/display_name). Key changes: - Rewrite base.py: replace BaseAnalyzer with _AnalyzerMetaMixin + separate base classes (MessageAnalyzer, ConversationAnalyzer, DatasetAnalyzer, PreferenceAnalyzer) using generic type introspection - Add new analyzers: DeduplicationAnalyzer, LLMAnalyzer (with criteria presets: Safety, Usefulness, Factuality, Coherence, InstructionFollowing) - Add custom_metrics.py for user-defined Python metric functions - Add cli.py with ANALYZER_REGISTRY, create_analyzer_from_config factory, and run_typed_analysis pipeline runner - Rewrite cli/analyze.py to use TypedAnalyzeConfig YAML configs - Extend testing engine with TestConfig, PERCENTAGE and RANGE test types - Align config field names with API: AnalyzerConfig uses type/display_name (backward compat: id/instance_id still accepted), TestConfigYAML uses display_name (backward compat: title still accepted) - Preserve API backward compatibility: re-export create_analyzer_from_config, TestParams, BatchTestEngine with original import paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gitar-bot · 2026-04-14T20:11:23Z

Gitar is working

_Gitar

…ngine Rewrite tests for the refactored typed analyzer framework covering DataQualityAnalyzer, TurnStatsAnalyzer, and TestEngine with their Pydantic-based result models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite user guide and config reference to document the new typed analyzer system (TypedAnalyzeConfig, type/display_name fields, tests, metric paths). Update example config to include all three built-in analyzers and sample test configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove analyzers that aren't needed at this stage: DeduplicationAnalyzer, LLMAnalyzer (and all convenience subclasses), and llm_criteria. Clean up all imports, registry entries, config enum, and docs references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The analyzer-refactor branch had an earlier/simpler version of LengthAnalyzer that dropped features used by the API (rendered_tokens, tool_total_tokens, LengthAnalyzerConfig). Restore the main branch version and update tests to match its behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Same principle as the length.py revert — keep the analyzer that's already working for the API and only change what's needed for CLI alignment. Rewrote test_quality_analyzer.py (27 tests) for main's 5-field DataQualityMetrics and updated docs accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep the analyzer that's already working for the API. Rewrote test_turn_stats_analyzer.py (18 tests) for main's 7-field TurnStatsMetrics and updated docs accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous revert missed two fields (has_no_user_message, has_system_message_not_at_start) that exist on main. Now all three analyzers are identical to main with zero diff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Quality and turn_stats analyzers on main have no configurable params. Removed commented-out references to check_truncation, check_refusals, include_system_in_counts, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

LengthAnalyzerConfig and Tokenizer were incorrectly dropped from exports even though length.py still defines them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Restore --dataset_name, --dataset_path, --sample_count CLI overrides - Restore --list for config aliases and --log-level flag - Restore config alias resolution and autocompletion - Add old AnalyzeConfig (v1) format detection with helpful migration message instead of a cryptic crash - Organize help into rich_help_panel groups (Options, Data, Output) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Removed references to truncation and policy refusal detection which don't exist in the quality analyzer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-arman · 2026-04-15T17:30:38Z

Code review

Found 2 issues:

Custom metrics via CLI are completely broken — all three CLI call sites invoke TypedAnalyzeConfig.from_yaml() without passing allow_custom_code=True. The default (False) raises ValueError when any custom metric has a function field, making the feature non-functional. In the --list-metrics path, the error is silently swallowed by a bare except Exception: pass. Either the CLI should expose a --allow-custom-code flag, or pass True for trusted local invocations.

oumi/src/oumi/cli/analyze.py

Lines 118 to 120 in b71dcb7

    
           try: 
        
               typed_config = TypedAnalyzeConfig.from_yaml(config) 
        
               if typed_config.custom_metrics:

oumi/src/oumi/cli/analyze.py

Lines 168 to 170 in b71dcb7

    
           ): 
        
               typed_config = TypedAnalyzeConfig.from_yaml(config)

oumi/src/oumi/analyze/cli.py

Lines 468 to 470 in b71dcb7

    
           # Load config 
        
           config = TypedAnalyzeConfig.from_yaml(config_path)

Security gate defined at:

oumi/src/oumi/analyze/config.py

Lines 397 to 406 in b71dcb7

    
           # Security check: reject custom code unless explicitly allowed 
        
           if not allow_custom_code: 
        
               metrics_with_code = [m.id for m in custom_metrics if m.function.strip()] 
        
               if metrics_with_code: 
        
                   raise ValueError( 
        
                       f"Configuration contains custom metrics with executable code: " 
        
                       f"{metrics_with_code}. This is a security risk if loading from " 
        
                       f"untrusted sources. Set allow_custom_code=True to explicitly " 
        
                       f"allow code execution, or remove the 'function' fields." 
        
                   )

Same-class analyzers without explicit analyzer_id silently overwrite each other — _get_analyzer_name falls back to analyzer.__class__.__name__ when no analyzer_id is set. Two instances of the same analyzer class (e.g., LengthAnalyzer with different params) get the same key, and the second overwrites the first in self._results. No warning is emitted. Config-level validation for duplicate display_name exists in from_dict(), but the pipeline itself has no guard when constructed directly with a list of analyzers.

oumi/src/oumi/analyze/pipeline.py

Lines 468 to 483 in b71dcb7

    
               def _get_analyzer_name(self, analyzer: AnyAnalyzer) -> str: 
        
                   """Get the name for an analyzer. 
        
                   Uses the class name by default, but can be overridden by 
        
                   setting an 'analyzer_id' attribute on the analyzer. 
        
                   Args: 
        
                       analyzer: The analyzer instance. 
        
                   Returns: 
        
                       Name string for the analyzer. 
        
                   """ 
        
                   analyzer_id = getattr(analyzer, "analyzer_id", None) 
        
                   if analyzer_id is not None: 
        
                       return str(analyzer_id) 
        
                   return analyzer.__class__.__name__

oumi/src/oumi/analyze/pipeline.py

Lines 361 to 363 in b71dcb7

    
           result = run_func(analyzer) 
        
           self._results[name] = result 
        
           if is_batch and isinstance(result, list):

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Custom metrics relied on exec() for arbitrary code execution from YAML configs, which posed security risks. Removes CustomMetricConfig, custom_metrics.py, and all references across config, CLI, tests, and docs. Also fixes docs to reflect actual quality/turn_stats analyzer params (neither has configurable constructor params on main). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Same-class analyzers without explicit analyzer_id silently overwrote each other in results. Now raises ValueError at construction time, matching the config-level validation in TypedAnalyzeConfig.from_dict(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_extract_metric_values skips None values, which shifted all indices in the returned list. Tests using enumerate() then reported wrong sample_indices (pointing to positions in the filtered list, not original conversation positions). Now returns (original_index, value) tuples so all three test types report correct affected conversation indices. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ge test The percentage test only built failure_reasons for non-matching samples. When max_percentage failed (too many matches), the affected indices were matching ones but failure_reasons had no entries for them, leaving test details empty. Now tracks matching_reasons and non_matching_reasons separately, mirroring the threshold test pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_get_result_type only checked cls.__orig_bases__, which breaks for subclasses of concrete analyzers (e.g. class MyAnalyzer(LengthAnalyzer)) since __orig_bases__ wouldn't contain a generic. Now walks cls.__mro__ to find the generic type argument from any ancestor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BatchTestEngine was exported on main but dropped in this branch. Also restore the core TestType (from oumi.core.configs.params) as the default TestType export for backward compat with the batch engine and API layer. The new engine TestType is available as TypedTestType. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Field existed on main but was dropped in this branch. batch_engine.py still passes it when constructing TestResult; Pydantic v2 silently ignores the unknown kwarg, causing AttributeError downstream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AnalyzerConfig and TestConfigYAML are dataclasses, so typos in YAML fields raise TypeError ("unexpected keyword argument"). The CLI catches ValueError with a friendly message but TypeError fell through to a generic handler. Now wraps both construction sites to convert TypeError into ValueError listing valid fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dataset-level cached results are raw dicts (from JSON), not BaseModel instances. isinstance(result, BaseModel) silently skipped them, dropping dataset-level metrics from the DataFrame. Now also accepts dict for both to_analysis_dataframe and results_to_dict. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

discovery.py imported ANALYZER_REGISTRY from oumi.analyze.cli, creating an inverted dependency (utility module depending on CLI module). Now uses oumi.core.registry.REGISTRY directly, matching the main branch pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use Mapping/Sequence in dataframe util signatures for type covariance - Add debug logging to empty except blocks in discovery and engine - Count non-comparable range test values as affected (consistent with threshold/percentage tests) - Add isinstance narrowing in test_pipeline for pyright satisfaction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-arman · 2026-04-15T18:24:25Z

Both findings have been addressed:

1. Custom metrics via CLI are completely broken
Not applicable — custom_metrics.py was deleted in this PR along with all CustomMetricConfig references, the allow_custom_code gate, and the custom metrics YAML section. The entire feature was removed.

2. Same-class analyzers without explicit analyzer_id silently overwrite each other
Fixed — AnalysisPipeline.__init__() now validates for duplicate analyzer names at construction time and raises ValueError with a clear message. Test: test_duplicate_analyzer_names_raises_error in test_pipeline.py. The config-level from_dict() validation for duplicate display_name values was already in place; this adds the equivalent guard for direct pipeline construction.

Move all orchestration logic (run_typed_analysis, save_results, print_summary, data loading, analyzer creation) from the separate analyze/cli.py into the existing cli/analyze.py entry point. - Delete src/oumi/analyze/cli.py and its duplicate ANALYZER_REGISTRY - Use core registry (REGISTRY.get_sample_analyzer) instead - Update analyze/__init__.py to re-export from oumi.cli.analyze - Update tests to use core registry for registration checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Drop stdlib `import logging` and merge two logger instances into the project-standard `oumi.utils.logging.logger`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The example analyze.yaml uses TypedAnalyzeConfig (v2) format with fields like 'type' and 'display_name' that don't exist in the old AnalyzeConfig schema. Exclude it from the config parse test that tries all old config classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These test files had pre-existing branch changes to fixtures and test bodies that were outside the scope of this PR. Reverted to main to keep the test suite unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…fix bugs - Remove PERCENTAGE and RANGE test types and their implementations - Restore _create_error_result helper to reduce inline repetition - Restore _get_nested_value with dict traversal and CustomMetricResult support - Fix threshold field using `is not None` instead of `or` (0.0 was falsy) - Populate all_affected_indices in TestResult construction - Add __test__ = False to prevent pytest collection warnings - Remove corresponding percentage/range tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update analyze config, example YAML, and documentation to reflect that only threshold tests are supported. Convert percentage/range examples to threshold equivalents. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TestConfig now accepts both strings and enums for type/severity, removing the need for a separate YAML parsing class. Strings are converted to enums in __init__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tency Remove the local TestConfig dataclass and TestType enum from engine.py in favor of TestParams from oumi.core.configs.params.test_params, which is the canonical test config used throughout the API codebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align with BatchTestEngine behavior — raise instead of silently returning None when encountering a non-BaseModel, non-dict type during metric path traversal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove verbose docstrings, inline narration comments, section dividers, and bloated module docstrings across all files changed in this branch. Fix analyze.md programmatic access example, YAML indentation, and add custom analyzer documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Port the oumi analyze CLI from the legacy DatasetAnalyzer path onto the typed analyzer framework (TypedAnalyzeConfig, AnalysisPipeline, TestEngine) that already lives in src/oumi/analyze/ on main. - Rewrite src/oumi/cli/analyze.py to load TypedAnalyzeConfig from YAML, construct analyzers via the core registry, run AnalysisPipeline, and optionally execute TestEngine - Restore --list, --list-metrics, --log-level, --dataset_name, --dataset_path, --sample_count, --output, --format flags - Emit a friendly migration error when a v1 AnalyzeConfig YAML is detected (checks for dataset_source, processor_name, etc.) - Wire analyze as a nested Typer app in cli/main.py so help panels render correctly - Update configs/examples/analyze/analyze.yaml to the v2 schema - Rewrite docs/user_guides/analyze/{analyze,analyze_config}.md for the v2 schema - Exclude configs/examples/analyze/analyze.yaml from the legacy test_parse_configs sweep (uses TypedAnalyzeConfig, not AnalyzeConfig) First PR in a 4-PR split of #2370. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove _item_to_conversation's hardcoded prompt/instruction/response/ context key lists and simplify load_conversations_from_dataset to a strict Oumi-format loader (Conversation.from_dict per row, warn-and-skip on failure). Matches the pre-v2 CLI contract of requiring Oumi-shaped data and stays consistent with the api backend, which uses typed schemas or explicit column names rather than field guessing. Update the example YAML and the analyze docs to use placeholder dataset names and note that HF rows must already be in Oumi format; instruction- style datasets should be pre-converted to Oumi JSONL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-code-quality bot found potential problems Apr 14, 2026

View reviewed changes

ryan-arman and others added 12 commits April 14, 2026 13:19

revert: restore turn_stats.py from main, rewrite tests to match

6f11088

Keep the analyzer that's already working for the API. Rewrote test_turn_stats_analyzer.py (18 tests) for main's 7-field TurnStatsMetrics and updated docs accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: remove stale params from example analyze config

bae5037

Quality and turn_stats analyzers on main have no configurable params. Removed commented-out references to check_truncation, check_refusals, include_system_in_counts, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: restore analyzers __init__.py exports to match main

6db1c4a

LengthAnalyzerConfig and Tokenizer were incorrectly dropped from exports even though length.py still defines them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: update docs key capabilities to match actual quality checks

a254dce

Removed references to truncation and policy refusal detection which don't exist in the quality analyzer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: use --config in CLI examples, mention -c shorthand

b71dcb7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-code-quality bot found potential problems Apr 15, 2026

View reviewed changes

Comment thread src/oumi/analyze/testing/engine.py Fixed

ryan-arman and others added 5 commits April 15, 2026 10:59

github-code-quality bot found potential problems Apr 15, 2026

View reviewed changes

Comment thread src/oumi/analyze/testing/engine.py Fixed

ryan-arman and others added 6 commits April 15, 2026 11:09

Merge branch 'main' into ryan-arman/analyzer-cli-update

998d9fb

github-code-quality bot found potential problems Apr 15, 2026

View reviewed changes

Comment thread src/oumi/cli/analyze.py Fixed

ryan-arman and others added 13 commits April 15, 2026 14:58

fix: unify duplicate logging imports in cli/analyze.py

31d8e5d

Drop stdlib `import logging` and merge two logger instances into the project-standard `oumi.utils.logging.logger`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

revert: restore analyzer test files to match main

08bdf8a

These test files had pre-existing branch changes to fixtures and test bodies that were outside the scope of this PR. Reverted to main to keep the test suite unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: use --config instead of -c in analyze.yaml usage comment

e4e55ea

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

update

401c816

docs: clarify --output CLI flag vs output_path config field

1a052d2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: remove unnecessary comments in testing __init__.py

a3a8c42

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: merge TestConfigYAML into TestConfig

076019a

TestConfig now accepts both strings and enums for type/severity, removing the need for a separate YAML parsing class. Strings are converted to enums in __init__. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Apr 17, 2026

feat: migrate analyze CLI to typed analyzer framework #2375

Draft

refactor(analyze): split AnalyzerConfig into type/id/display_name #2376

Draft

fix(analyze): preserve sample indices and fix threshold truthiness #2377

Draft

ryan-arman mentioned this pull request Apr 17, 2026

fix(analyze): handle cached dicts in dataframe conversion #2378

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update analyze CLI to use typed analyzer framework#2370

feat: update analyze CLI to use typed analyzer framework#2370
ryan-arman wants to merge 40 commits intomainfrom
ryan-arman/analyzer-cli-update

ryan-arman commented Apr 14, 2026

Uh oh!

gitar-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryan-arman commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

ryan-arman commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryan-arman commented Apr 14, 2026

Summary

API Compatibility

Config Changes

Test plan

Uh oh!

gitar-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryan-arman commented Apr 15, 2026

Code review

Uh oh!

Uh oh!

Uh oh!

ryan-arman commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gitar-bot bot commented Apr 14, 2026 •

edited

Loading