Skip to content

feat: migrate analyze CLI to typed analyzer framework#2375

Draft
ryan-arman wants to merge 2 commits intomainfrom
ryan-arman/analyze-cli-v2
Draft

feat: migrate analyze CLI to typed analyzer framework#2375
ryan-arman wants to merge 2 commits intomainfrom
ryan-arman/analyze-cli-v2

Conversation

@ryan-arman
Copy link
Copy Markdown
Contributor

Description

Migrates the oumi analyze CLI from the legacy DatasetAnalyzer path onto the typed analyzer framework (TypedAnalyzeConfig + AnalysisPipeline + TestEngine) that already lives in src/oumi/analyze/ on main. First PR of a 4-PR split of #2370 — keeps this change focused on the CLI so the framework refactors, v2 naming alignment, and bug fixes can ship and be reviewed independently.

What changed

  • Rewrite src/oumi/cli/analyze.py to load TypedAnalyzeConfig from YAML, build analyzers via REGISTRY.get_sample_analyzer, run AnalysisPipeline, and optionally execute TestEngine.
  • Restore --list, --list-metrics, --log-level, --dataset_name, --dataset_path, --sample_count, --output, --format flags with rich help panels.
  • Detect legacy AnalyzeConfig (v1) YAMLs (presence of dataset_source, processor_name, is_multimodal, etc.) and emit a friendly migration error instead of a cryptic crash.
  • Wrap analyze as a nested Typer app in src/oumi/cli/main.py so subcommands and help panels render.
  • Update configs/examples/analyze/analyze.yaml to the v2 schema.
  • Rewrite docs/user_guides/analyze/{analyze,analyze_config}.md to document the v2 schema.
  • Exclude the v2 example YAML from the legacy-config sweep in tests/unit/core/configs/test_parse_configs.py.

Out of scope (deferred to follow-up PRs)

  • type/display_name rename in AnalyzerConfig → PR 2.
  • testing/engine.py refactor (local TestConfig → core TestParams) + bug fixes (None-index shift, MRO walk, all_affected_indices restore) → PR 3.
  • analyze/base.py, analyze/pipeline.py dup-name validation, analyze/discovery.py helpers, analyze/utils/dataframe.py raw-dict handling → PR 4.

Verification

  • pytest tests/unit/cli/test_cli_analyze.py — 7/7 pass.
  • pytest tests/unit/core/configs/test_parse_configs.py — 984/984 pass.
  • oumi analyze --config configs/examples/analyze/analyze.yaml --output /tmp/smoke --format json runs end-to-end, produces analysis.json, test_results.json, summary.json.

Related issues

Linear Issue: OPE-1868
Fixes OPE-1868

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Reviewers

At least one review from a member of `oumi-ai/oumi-staff` is required.

Port the oumi analyze CLI from the legacy DatasetAnalyzer path onto the
typed analyzer framework (TypedAnalyzeConfig, AnalysisPipeline,
TestEngine) that already lives in src/oumi/analyze/ on main.

- Rewrite src/oumi/cli/analyze.py to load TypedAnalyzeConfig from YAML,
  construct analyzers via the core registry, run AnalysisPipeline, and
  optionally execute TestEngine
- Restore --list, --list-metrics, --log-level, --dataset_name,
  --dataset_path, --sample_count, --output, --format flags
- Emit a friendly migration error when a v1 AnalyzeConfig YAML is
  detected (checks for dataset_source, processor_name, etc.)
- Wire analyze as a nested Typer app in cli/main.py so help panels
  render correctly
- Update configs/examples/analyze/analyze.yaml to the v2 schema
- Rewrite docs/user_guides/analyze/{analyze,analyze_config}.md for the
  v2 schema
- Exclude configs/examples/analyze/analyze.yaml from the legacy
  test_parse_configs sweep (uses TypedAnalyzeConfig, not AnalyzeConfig)

First PR in a 4-PR split of #2370.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 17, 2026

Gitar is working

Gitar

Remove _item_to_conversation's hardcoded prompt/instruction/response/
context key lists and simplify load_conversations_from_dataset to a
strict Oumi-format loader (Conversation.from_dict per row, warn-and-skip
on failure). Matches the pre-v2 CLI contract of requiring Oumi-shaped
data and stays consistent with the api backend, which uses typed schemas
or explicit column names rather than field guessing.

Update the example YAML and the analyze docs to use placeholder dataset
names and note that HF rows must already be in Oumi format; instruction-
style datasets should be pre-converted to Oumi JSONL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant