Skip to content

feat: update analyze CLI to use typed analyzer framework#2370

Draft
ryan-arman wants to merge 40 commits intomainfrom
ryan-arman/analyzer-cli-update
Draft

feat: update analyze CLI to use typed analyzer framework#2370
ryan-arman wants to merge 40 commits intomainfrom
ryan-arman/analyzer-cli-update

Conversation

@ryan-arman
Copy link
Copy Markdown
Contributor

Summary

  • Port the typed analyzer framework from the analyzer-refactor branch, replacing the old DataFrame-centric approach with Pydantic-based typed result models
  • Align config naming convention with the API's pattern: AnalyzerConfig uses type/display_name (backward compat for id/instance_id), TestConfigYAML uses display_name (backward compat for title)
  • Add new analyzers (Deduplication, LLM-as-judge with preset criteria), custom metric support, and extended test engine with PERCENTAGE/RANGE test types
  • Preserve full API backward compatibility: create_analyzer_from_config, TestParams, BatchTestEngine all maintain original import paths and behavior

API Compatibility

The API worker uses oumi as a submodule. All existing import patterns are preserved:

  • from oumi.analyze import AnalysisPipeline, create_analyzer_from_config
  • from oumi.analyze.testing import TestParams as OumiTestParams
  • from oumi.analyze.testing.batch_engine import BatchTestEngine
  • analyzer.analyzer_id = display_name pattern continues to work

Config Changes

YAML configs now use type/display_name (matching API's AnalyzerConfigInput):

analyzers:
  - type: length
    display_name: Length
    params:
      tokenizer_name: cl100k_base

Old format with id/instance_id is still accepted for backward compatibility.

Test plan

  • All 120 analyze unit tests pass
  • All 7 CLI analyze tests pass
  • API import compatibility verified
  • End-to-end CLI test (oumi analyze -c config.yaml -o /tmp/test -f json)
  • Backward compat: old-style YAML configs with id/instance_id/title still parse correctly
  • Review against API's analyze_activity.py usage patterns

🤖 Generated with Claude Code

Port the typed analyzer framework from the analyzer-refactor branch to
update the `oumi analyze` CLI. This replaces the old DataFrame-centric
approach with Pydantic-based typed result models and aligns the config
naming convention with the API's pattern (type/display_name).

Key changes:
- Rewrite base.py: replace BaseAnalyzer with _AnalyzerMetaMixin + separate
  base classes (MessageAnalyzer, ConversationAnalyzer, DatasetAnalyzer,
  PreferenceAnalyzer) using generic type introspection
- Add new analyzers: DeduplicationAnalyzer, LLMAnalyzer (with criteria
  presets: Safety, Usefulness, Factuality, Coherence, InstructionFollowing)
- Add custom_metrics.py for user-defined Python metric functions
- Add cli.py with ANALYZER_REGISTRY, create_analyzer_from_config factory,
  and run_typed_analysis pipeline runner
- Rewrite cli/analyze.py to use TypedAnalyzeConfig YAML configs
- Extend testing engine with TestConfig, PERCENTAGE and RANGE test types
- Align config field names with API: AnalyzerConfig uses type/display_name
  (backward compat: id/instance_id still accepted), TestConfigYAML uses
  display_name (backward compat: title still accepted)
- Preserve API backward compatibility: re-export create_analyzer_from_config,
  TestParams, BatchTestEngine with original import paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 14, 2026

Gitar is working

Gitar

Comment thread src/oumi/analyze/custom_metrics.py Fixed
Comment thread src/oumi/analyze/custom_metrics.py Fixed
Comment thread src/oumi/analyze/custom_metrics.py Fixed
Comment thread src/oumi/analyze/custom_metrics.py Fixed
Comment thread src/oumi/analyze/discovery.py Fixed
Comment thread src/oumi/analyze/discovery.py Fixed
Comment thread src/oumi/analyze/testing/engine.py Fixed
Comment thread src/oumi/analyze/analyzers/llm_analyzer.py Fixed
Comment thread src/oumi/analyze/analyzers/llm_analyzer.py Fixed
Comment thread src/oumi/analyze/analyzers/llm_analyzer.py Fixed
ryan-arman and others added 12 commits April 14, 2026 13:19
…ngine

Rewrite tests for the refactored typed analyzer framework covering
DataQualityAnalyzer, TurnStatsAnalyzer, and TestEngine with their
Pydantic-based result models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite user guide and config reference to document the new typed
analyzer system (TypedAnalyzeConfig, type/display_name fields, tests,
metric paths). Update example config to include all three built-in
analyzers and sample test configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove analyzers that aren't needed at this stage: DeduplicationAnalyzer,
LLMAnalyzer (and all convenience subclasses), and llm_criteria. Clean up
all imports, registry entries, config enum, and docs references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The analyzer-refactor branch had an earlier/simpler version of
LengthAnalyzer that dropped features used by the API (rendered_tokens,
tool_total_tokens, LengthAnalyzerConfig). Restore the main branch
version and update tests to match its behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same principle as the length.py revert — keep the analyzer that's
already working for the API and only change what's needed for CLI
alignment. Rewrote test_quality_analyzer.py (27 tests) for main's
5-field DataQualityMetrics and updated docs accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep the analyzer that's already working for the API. Rewrote
test_turn_stats_analyzer.py (18 tests) for main's 7-field
TurnStatsMetrics and updated docs accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous revert missed two fields (has_no_user_message,
has_system_message_not_at_start) that exist on main. Now all three
analyzers are identical to main with zero diff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Quality and turn_stats analyzers on main have no configurable params.
Removed commented-out references to check_truncation, check_refusals,
include_system_in_counts, etc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LengthAnalyzerConfig and Tokenizer were incorrectly dropped from
exports even though length.py still defines them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Restore --dataset_name, --dataset_path, --sample_count CLI overrides
- Restore --list for config aliases and --log-level flag
- Restore config alias resolution and autocompletion
- Add old AnalyzeConfig (v1) format detection with helpful migration
  message instead of a cryptic crash
- Organize help into rich_help_panel groups (Options, Data, Output)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removed references to truncation and policy refusal detection which
don't exist in the quality analyzer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryan-arman
Copy link
Copy Markdown
Contributor Author

Code review

Found 2 issues:

  1. Custom metrics via CLI are completely broken — all three CLI call sites invoke TypedAnalyzeConfig.from_yaml() without passing allow_custom_code=True. The default (False) raises ValueError when any custom metric has a function field, making the feature non-functional. In the --list-metrics path, the error is silently swallowed by a bare except Exception: pass. Either the CLI should expose a --allow-custom-code flag, or pass True for trusted local invocations.

try:
typed_config = TypedAnalyzeConfig.from_yaml(config)
if typed_config.custom_metrics:

):
typed_config = TypedAnalyzeConfig.from_yaml(config)

# Load config
config = TypedAnalyzeConfig.from_yaml(config_path)

Security gate defined at:

# Security check: reject custom code unless explicitly allowed
if not allow_custom_code:
metrics_with_code = [m.id for m in custom_metrics if m.function.strip()]
if metrics_with_code:
raise ValueError(
f"Configuration contains custom metrics with executable code: "
f"{metrics_with_code}. This is a security risk if loading from "
f"untrusted sources. Set allow_custom_code=True to explicitly "
f"allow code execution, or remove the 'function' fields."
)

  1. Same-class analyzers without explicit analyzer_id silently overwrite each other_get_analyzer_name falls back to analyzer.__class__.__name__ when no analyzer_id is set. Two instances of the same analyzer class (e.g., LengthAnalyzer with different params) get the same key, and the second overwrites the first in self._results. No warning is emitted. Config-level validation for duplicate display_name exists in from_dict(), but the pipeline itself has no guard when constructed directly with a list of analyzers.

def _get_analyzer_name(self, analyzer: AnyAnalyzer) -> str:
"""Get the name for an analyzer.
Uses the class name by default, but can be overridden by
setting an 'analyzer_id' attribute on the analyzer.
Args:
analyzer: The analyzer instance.
Returns:
Name string for the analyzer.
"""
analyzer_id = getattr(analyzer, "analyzer_id", None)
if analyzer_id is not None:
return str(analyzer_id)
return analyzer.__class__.__name__

result = run_func(analyzer)
self._results[name] = result
if is_batch and isinstance(result, list):

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Custom metrics relied on exec() for arbitrary code execution from YAML
configs, which posed security risks. Removes CustomMetricConfig,
custom_metrics.py, and all references across config, CLI, tests, and docs.
Also fixes docs to reflect actual quality/turn_stats analyzer params
(neither has configurable constructor params on main).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/oumi/analyze/testing/engine.py Fixed
ryan-arman and others added 5 commits April 15, 2026 10:59
Same-class analyzers without explicit analyzer_id silently overwrote
each other in results. Now raises ValueError at construction time,
matching the config-level validation in TypedAnalyzeConfig.from_dict().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_extract_metric_values skips None values, which shifted all indices in
the returned list. Tests using enumerate() then reported wrong
sample_indices (pointing to positions in the filtered list, not
original conversation positions). Now returns (original_index, value)
tuples so all three test types report correct affected conversation
indices.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ge test

The percentage test only built failure_reasons for non-matching samples.
When max_percentage failed (too many matches), the affected indices were
matching ones but failure_reasons had no entries for them, leaving test
details empty. Now tracks matching_reasons and non_matching_reasons
separately, mirroring the threshold test pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_get_result_type only checked cls.__orig_bases__, which breaks for
subclasses of concrete analyzers (e.g. class MyAnalyzer(LengthAnalyzer))
since __orig_bases__ wouldn't contain a generic. Now walks cls.__mro__
to find the generic type argument from any ancestor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BatchTestEngine was exported on main but dropped in this branch.
Also restore the core TestType (from oumi.core.configs.params) as
the default TestType export for backward compat with the batch engine
and API layer. The new engine TestType is available as TypedTestType.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/oumi/analyze/testing/engine.py Fixed
ryan-arman and others added 6 commits April 15, 2026 11:09
Field existed on main but was dropped in this branch. batch_engine.py
still passes it when constructing TestResult; Pydantic v2 silently
ignores the unknown kwarg, causing AttributeError downstream.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AnalyzerConfig and TestConfigYAML are dataclasses, so typos in YAML
fields raise TypeError ("unexpected keyword argument"). The CLI catches
ValueError with a friendly message but TypeError fell through to a
generic handler. Now wraps both construction sites to convert TypeError
into ValueError listing valid fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dataset-level cached results are raw dicts (from JSON), not BaseModel
instances. isinstance(result, BaseModel) silently skipped them, dropping
dataset-level metrics from the DataFrame. Now also accepts dict for
both to_analysis_dataframe and results_to_dict.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
discovery.py imported ANALYZER_REGISTRY from oumi.analyze.cli, creating
an inverted dependency (utility module depending on CLI module). Now uses
oumi.core.registry.REGISTRY directly, matching the main branch pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Mapping/Sequence in dataframe util signatures for type covariance
- Add debug logging to empty except blocks in discovery and engine
- Count non-comparable range test values as affected (consistent with
  threshold/percentage tests)
- Add isinstance narrowing in test_pipeline for pyright satisfaction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryan-arman
Copy link
Copy Markdown
Contributor Author

Both findings have been addressed:

1. Custom metrics via CLI are completely broken
Not applicable — custom_metrics.py was deleted in this PR along with all CustomMetricConfig references, the allow_custom_code gate, and the custom metrics YAML section. The entire feature was removed.

2. Same-class analyzers without explicit analyzer_id silently overwrite each other
Fixed — AnalysisPipeline.__init__() now validates for duplicate analyzer names at construction time and raises ValueError with a clear message. Test: test_duplicate_analyzer_names_raises_error in test_pipeline.py. The config-level from_dict() validation for duplicate display_name values was already in place; this adds the equivalent guard for direct pipeline construction.

Move all orchestration logic (run_typed_analysis, save_results,
print_summary, data loading, analyzer creation) from the separate
analyze/cli.py into the existing cli/analyze.py entry point.

- Delete src/oumi/analyze/cli.py and its duplicate ANALYZER_REGISTRY
- Use core registry (REGISTRY.get_sample_analyzer) instead
- Update analyze/__init__.py to re-export from oumi.cli.analyze
- Update tests to use core registry for registration checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/oumi/cli/analyze.py Fixed
ryan-arman and others added 13 commits April 15, 2026 14:58
Drop stdlib `import logging` and merge two logger instances into
the project-standard `oumi.utils.logging.logger`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The example analyze.yaml uses TypedAnalyzeConfig (v2) format with
fields like 'type' and 'display_name' that don't exist in the old
AnalyzeConfig schema. Exclude it from the config parse test that
tries all old config classes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These test files had pre-existing branch changes to fixtures and test
bodies that were outside the scope of this PR. Reverted to main to
keep the test suite unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fix bugs

- Remove PERCENTAGE and RANGE test types and their implementations
- Restore _create_error_result helper to reduce inline repetition
- Restore _get_nested_value with dict traversal and CustomMetricResult support
- Fix threshold field using `is not None` instead of `or` (0.0 was falsy)
- Populate all_affected_indices in TestResult construction
- Add __test__ = False to prevent pytest collection warnings
- Remove corresponding percentage/range tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update analyze config, example YAML, and documentation to reflect
that only threshold tests are supported. Convert percentage/range
examples to threshold equivalents.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestConfig now accepts both strings and enums for type/severity,
removing the need for a separate YAML parsing class. Strings are
converted to enums in __init__.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tency

Remove the local TestConfig dataclass and TestType enum from engine.py
in favor of TestParams from oumi.core.configs.params.test_params, which
is the canonical test config used throughout the API codebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align with BatchTestEngine behavior — raise instead of silently
returning None when encountering a non-BaseModel, non-dict type
during metric path traversal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove verbose docstrings, inline narration comments, section dividers,
and bloated module docstrings across all files changed in this branch.
Fix analyze.md programmatic access example, YAML indentation, and add
custom analyzer documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ryan-arman added a commit that referenced this pull request Apr 17, 2026
Port the oumi analyze CLI from the legacy DatasetAnalyzer path onto the
typed analyzer framework (TypedAnalyzeConfig, AnalysisPipeline,
TestEngine) that already lives in src/oumi/analyze/ on main.

- Rewrite src/oumi/cli/analyze.py to load TypedAnalyzeConfig from YAML,
  construct analyzers via the core registry, run AnalysisPipeline, and
  optionally execute TestEngine
- Restore --list, --list-metrics, --log-level, --dataset_name,
  --dataset_path, --sample_count, --output, --format flags
- Emit a friendly migration error when a v1 AnalyzeConfig YAML is
  detected (checks for dataset_source, processor_name, etc.)
- Wire analyze as a nested Typer app in cli/main.py so help panels
  render correctly
- Update configs/examples/analyze/analyze.yaml to the v2 schema
- Rewrite docs/user_guides/analyze/{analyze,analyze_config}.md for the
  v2 schema
- Exclude configs/examples/analyze/analyze.yaml from the legacy
  test_parse_configs sweep (uses TypedAnalyzeConfig, not AnalyzeConfig)

First PR in a 4-PR split of #2370.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove _item_to_conversation's hardcoded prompt/instruction/response/
context key lists and simplify load_conversations_from_dataset to a
strict Oumi-format loader (Conversation.from_dict per row, warn-and-skip
on failure). Matches the pre-v2 CLI contract of requiring Oumi-shaped
data and stays consistent with the api backend, which uses typed schemas
or explicit column names rather than field guessing.

Update the example YAML and the analyze docs to use placeholder dataset
names and note that HF rows must already be in Oumi format; instruction-
style datasets should be pre-converted to Oumi JSONL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant