diff --git a/docs/index.md b/docs/index.md index e2d64a7b41..321e937948 100644 --- a/docs/index.md +++ b/docs/index.md @@ -60,6 +60,7 @@ get_started/tutorials user_guides/train/train user_guides/infer/infer +user_guides/deploy user_guides/evaluate/evaluate user_guides/analyze/analyze user_guides/judge/judge @@ -68,6 +69,7 @@ user_guides/synth user_guides/tune user_guides/quantization user_guides/customization +user_guides/mcp ``` ```{toctree} diff --git a/docs/user_guides/analyze/analyze.md b/docs/user_guides/analyze/analyze.md index 43061c9b8a..a8319b8fd4 100644 --- a/docs/user_guides/analyze/analyze.md +++ b/docs/user_guides/analyze/analyze.md @@ -71,6 +71,153 @@ The built-in `length` analyzer computes text length metrics: Enable token counting by adding `tokenizer_config` to your configuration. See {doc}`analyze_config` for setup details. ::: +### Data Quality Analyzer + +The built-in `quality` analyzer ({py:class}`~oumi.analyze.analyzers.quality.DataQualityAnalyzer`) catches five common data issues without running any model inference. It's meant as a cheap, first-pass sanity check before training or fine-tuning. + +| Field | What it flags | +|------------------------------------|---------------------------------------------------------------------------| +| `has_non_alternating_turns` | Consecutive same-role messages (`user`, `user`, …) in non-system turns | +| `has_no_user_message` | Conversation has no `user` message at all (including empty conversations) | +| `has_system_message_not_at_start` | A `system` message appears anywhere other than position 0 | +| `has_empty_turns` / `empty_turn_count` | Any message whose content is empty or whitespace-only | +| `has_invalid_values` / `invalid_value_patterns` | Strings like `NaN`, `null`, `None`, `undefined` leaked into content | + +```yaml +analyzers: + - id: quality +``` + +Because the output is typed ({py:class}`~oumi.analyze.analyzers.quality.DataQualityMetrics`), quality fields can be referenced by later **tests** using dotted metric paths (see [Testing Framework](#testing-framework)), e.g. `quality.has_no_user_message`. + +### Turn Stats Analyzer + +The built-in `turn_stats` analyzer ({py:class}`~oumi.analyze.analyzers.turn_stats.TurnStatsAnalyzer`) reports conversation shape: `num_turns`, `num_user_turns`, `num_assistant_turns`, `has_system_message`, `first_turn_role`, `last_turn_role`. Useful for finding malformed or single-sided conversations. + +```yaml +analyzers: + - id: turn_stats +``` + +## Typed Analyzer Framework + +All built-in analyzers above (`length`, `quality`, `turn_stats`) are implemented in the **typed analyzer framework** ({py:class}`~oumi.analyze.base.BaseAnalyzer`). Each analyzer declares a pydantic result model, which gives you: + +- **Auto-generated JSON schemas** for result documentation and validation. +- **Typed access** to analyzer output in Python (fields are proper attributes, not dict keys). +- **Metric paths** for the testing framework — `{analyzer_id}.{field_name}`, or `{instance_id}.{field_name}` when you run multiple instances of the same analyzer. + +### Defining a Typed Analyzer + +```python +from pydantic import BaseModel, Field +from oumi.analyze.base import ConversationAnalyzer +from oumi.core.registry import register_sample_analyzer +from oumi.core.types.conversation import Conversation + + +class QuestionMetrics(BaseModel): + num_questions: int = Field(description="Count of '?' characters") + density: float = Field(description="Questions per message") + + +@register_sample_analyzer("questions") +class QuestionAnalyzer(ConversationAnalyzer[QuestionMetrics]): + _result_model = QuestionMetrics + + @classmethod + def get_config_schema(cls) -> dict: + return {"properties": {}} + + def analyze(self, conversation: Conversation) -> QuestionMetrics: + total = sum(m.content.count("?") for m in conversation.messages) + return QuestionMetrics( + num_questions=total, + density=total / max(len(conversation.messages), 1), + ) +``` + +Point the config at your typed analyzer the same way as built-ins: + +```yaml +analyzers: + - id: questions + instance_id: questions # required for typed analyzers +``` + +When you need two configurations of the same analyzer (e.g. two `length` analyzers with different tokenizers), give each one a unique `instance_id`. + +### Custom Metrics (No Code Registration Required) + +For quick one-offs you don't want to package as an analyzer, declare a `custom_metrics` block directly in YAML: + +```yaml +custom_metrics: + - id: word_to_char_ratio + scope: conversation # message | conversation | dataset + description: "Ratio of words to characters" + output_schema: + - name: ratio + type: float + description: "Words divided by characters" + function: | + def compute(conversation): + chars = sum(len(m.content) for m in conversation.messages) + words = sum(len(m.content.split()) for m in conversation.messages) + return {"ratio": words / chars if chars > 0 else 0.0} +``` + +```{warning} +Custom metric `function` strings are compiled and run as arbitrary Python. Only load configs from sources you trust. +``` + +## Testing Framework + +The typed framework also ships a **testing** layer that evaluates analyzer output against thresholds and produces a pass/fail summary — useful for CI, regression detection, and "fail the run if more than 5% of conversations are missing a user message". + +### Defining Tests + +```yaml +tests: + - id: max_words + type: threshold + metric: length.total_words # . + operator: ">" + value: 10000 + max_percentage: 5.0 # fail if >5% of conversations match + + - id: no_missing_user_msg + type: threshold + metric: quality.has_no_user_message + operator: "==" + value: true + max_percentage: 0.0 # fail if any conversation is missing a user +``` + +Each test compares a metric to a `value` using `operator`, then checks whether the flagged fraction exceeds `max_percentage` (or falls below `min_percentage`). + +### Running Tests Incrementally with BatchTestEngine + +For large datasets where full analyzer output won't fit in memory, use {py:class}`~oumi.analyze.testing.batch_engine.BatchTestEngine`. It accumulates only lightweight counters and per-test affected conversation IDs as batches stream through, then returns a `TestSummary` at the end: + +```python +from oumi.analyze.testing.batch_engine import BatchTestEngine + +engine = BatchTestEngine(config.tests) + +for batch_results, batch_conversation_ids in stream_batches(): + engine.process_batch(batch_results, batch_conversation_ids) + +summary = engine.finalize() +print(f"{summary.passed_tests}/{summary.total_tests} passed " + f"({summary.pass_rate}%)") + +# IDs of conversations that caused test failures, per test: +affected = engine.get_affected_conversation_ids() +``` + +Use the standard `TestEngine` (same module) when the full dataset fits in memory; use `BatchTestEngine` when it doesn't. + ## Working with Results ### Analysis Summary diff --git a/docs/user_guides/deploy.md b/docs/user_guides/deploy.md new file mode 100644 index 0000000000..74f9a0de12 --- /dev/null +++ b/docs/user_guides/deploy.md @@ -0,0 +1,121 @@ +# Deploying Models + +Oumi provides a top-level `oumi deploy` command for taking a trained or downloaded model and standing it up as a managed inference endpoint on a third-party provider. Today it supports **Fireworks AI** and **Parasail.io**. + +```{admonition} Related +:class: note +- To *launch training* on remote clusters, see {doc}`/user_guides/launch/launch`. +- To *call* a deployed endpoint, see {doc}`/user_guides/infer/inference_engines`. +``` + +## Overview + +The deploy workflow has three stages, each exposed as a sub-command: + +1. **Upload** — push the model (full weights or a LoRA adapter) to the provider. +2. **Create endpoint** — provision hardware and start serving the uploaded model. +3. **Test / use** — smoke-test the endpoint and then call it with any inference engine. + +For the common case, `oumi deploy up` runs all three stages end-to-end from a single YAML config. + +## Prerequisites + +- A provider account and API key exported in your shell: + - Fireworks: `FIREWORKS_API_KEY` + - Parasail: `PARASAIL_API_KEY` +- For Fireworks, the model must exist on your local disk (HuggingFace download or an Oumi training output). + +## Quick Start: End-to-End Deploy + +```bash +oumi deploy up --config configs/examples/deploy/fireworks_deploy.yaml +``` + +The `--config` YAML matches the {py:class}`~oumi.deploy.deploy_config.DeploymentConfig` schema: + +```yaml +# configs/examples/deploy/fireworks_deploy.yaml +model_source: /path/to/my-finetuned-model/ # local directory +provider: fireworks # fireworks | parasail +model_name: my-finetuned-model-v1 # display name on the provider +model_type: full # full | adapter +# base_model: accounts/fireworks/models/llama-v3p1-8b-instruct # required if adapter + +hardware: + accelerator: nvidia_h100_80gb # see `oumi deploy list-hardware` + count: 2 + +autoscaling: + min_replicas: 1 + max_replicas: 4 + +test_prompts: + - "Hello, how are you?" +``` + +Any of `model_source`, `provider`, and `hardware` can be overridden on the CLI, e.g.: + +```bash +oumi deploy up \ + --config fireworks_deploy.yaml \ + --model-path /tmp/llama3-8b \ + --hardware nvidia_a100_80gb +``` + +`oumi deploy up` will upload the model, wait for it to be ready, create an endpoint, optionally run any `test_prompts`, and print the endpoint URL. + +## Sub-Commands + +| Command | What it does | +|---------------------------------|----------------------------------------------------------------------| +| `oumi deploy up` | Full pipeline: upload → create endpoint → test | +| `oumi deploy upload` | Upload a model only | +| `oumi deploy create-endpoint` | Create an endpoint for a previously uploaded model | +| `oumi deploy list` | List all deployments on the provider | +| `oumi deploy list-models` | List uploaded models | +| `oumi deploy list-hardware` | List hardware options available for a provider | +| `oumi deploy status` | Show endpoint state, replica counts, URL | +| `oumi deploy start` / `stop` | Start or stop an existing endpoint (pause to save cost) | +| `oumi deploy delete` | Delete an endpoint | +| `oumi deploy delete-model` | Delete an uploaded model | +| `oumi deploy test` | Send a sample request to an endpoint | + +Add `--help` to any sub-command for the exact flags it accepts, or see {doc}`/cli/commands`. + +## Using a Deployed Endpoint + +Once `oumi deploy up` reports `RUNNING`, point any Oumi inference engine at the returned URL. For Fireworks: + +```python +from oumi.inference import FireworksInferenceEngine +from oumi.core.configs import ModelParams + +engine = FireworksInferenceEngine( + model_params=ModelParams(model_name="my-finetuned-model-v1") +) +``` + +For Parasail: + +```python +from oumi.inference import ParasailInferenceEngine +from oumi.core.configs import ModelParams + +engine = ParasailInferenceEngine( + model_params=ModelParams(model_name="my-finetuned-model-v1") +) +``` + +Both engines are documented in {doc}`/user_guides/infer/inference_engines`. + +## Tips + +- **Cost control.** Use `oumi deploy stop ` to pause an endpoint without deleting it; `start` brings it back online. Set `autoscaling.min_replicas: 0` if the provider supports scale-to-zero. +- **LoRA adapters.** Set `model_type: adapter` and a matching `base_model` to deploy a LoRA adapter on top of a hosted base model. This is usually cheaper than a full model. +- **Smoke tests.** `test_prompts` at the bottom of the YAML run automatically after `oumi deploy up` finishes — quick sanity check before sending real traffic. + +## See Also + +- {doc}`/user_guides/infer/inference_engines` — calling the deployed endpoint +- {doc}`/user_guides/launch/launch` — launching training jobs on remote clusters +- {doc}`/cli/commands` — CLI reference diff --git a/docs/user_guides/infer/inference_engines.md b/docs/user_guides/infer/inference_engines.md index 643265e48d..51ddd6ade3 100644 --- a/docs/user_guides/infer/inference_engines.md +++ b/docs/user_guides/infer/inference_engines.md @@ -412,6 +412,17 @@ The Anthropic models available via this API as of late Jan'2025 are listed below | Claude 3.0 Sonnet | claude-3-sonnet-20240229 | | Claude 3.0 Haiku | claude-3-haiku-20240307 | +**Prompt Caching** + +The Anthropic engine enables [prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) automatically by setting `cache_control: ephemeral` on every request. Anthropic caches content up to the last cacheable block, so repeated prefixes (system prompts, long context, multi-turn conversations) re-use the cache and reduce latency and cost. + +When caching is active, the per-response usage metadata includes two extra fields: + +- `cached_tokens` — tokens served from the prompt cache (populated from Anthropic's `cache_read_input_tokens`). +- `cache_creation_tokens` — tokens written to the cache on this request (populated from `cache_creation_input_tokens`). + +These are exposed on each `Message`'s `usage` metadata alongside `prompt_tokens`, `completion_tokens`, and `total_tokens`. See [Token Usage Tracking](#token-usage-tracking) for accumulating usage across a run. + **Resources** - [Anthropic API Documentation](https://docs.anthropic.com/en/api/getting-started) @@ -555,6 +566,38 @@ engine = TogetherInferenceEngine( The models available via this API can be found at [together.ai](https://www.together.ai/). +**Cache Token Reporting** + +The Together engine extracts `cached_tokens` from Together's usage response (a flat field in the `usage` object) and surfaces it on each `Message`'s `usage` metadata. This is useful when comparing cache hit rates across providers — see [Token Usage Tracking](#token-usage-tracking). + +### Cerebras + +[Cerebras](https://cerebras.ai) offers an extreme-low-latency inference platform backed by wafer-scale hardware. It exposes an OpenAI-compatible chat completions API. + +Set the `CEREBRAS_API_KEY` environment variable (get a key from [cloud.cerebras.ai](https://cloud.cerebras.ai/)). + +**Basic Usage** + +```{testcode} +from oumi.inference import CerebrasInferenceEngine +from oumi.core.configs import ModelParams, RemoteParams + +engine = CerebrasInferenceEngine( + model_params=ModelParams( + model_name="llama3.1-8b" + ) +) +``` + +**Supported Models** + +The models available via this API can be found at [inference-docs.cerebras.ai/models](https://inference-docs.cerebras.ai/models). + +**Resources** + +- [Cerebras Inference Documentation](https://inference-docs.cerebras.ai/) +- [Available Models](https://inference-docs.cerebras.ai/models) + ### DeepSeek [DeepSeek](https://deepseek.com) allows to access the DeepSeek models (Chat, Code, and Reasoning) through the DeepSeek AI Platform. @@ -772,26 +815,93 @@ if status.status.value == "completed": results = engine.get_batch_results(batch_id, conversations) ``` +### Cancelling a Batch Job + +Submitted batches can be cancelled before they complete. This is useful if you submitted by mistake, want to change parameters, or realise partial results are enough. + +```python +batch_info = engine.cancel_batch(batch_id) +print(f"Status: {batch_info.status}") # typically "cancelling" or "cancelled" +``` + +`cancel_batch` returns a `BatchInfo` with the updated status. Cancelling does not delete requests that have already been processed — you can still retrieve partial results (see below). + +### Partial Batch Results + +When a batch fails, is cancelled, or finishes with some errored rows, you can still recover the successfully completed conversations using `get_batch_results_partial`: + +```python +partial = engine.get_batch_results_partial(batch_id, conversations) +print(f"Completed: {len(partial.conversations)}") +print(f"Failed: {len(partial.failures)}") + +for failure in partial.failures: + print(f"Row {failure.index}: {failure.error}") +``` + +This returns a `BatchResult` containing successful conversations and a structured list of failures (with per-row error details), rather than raising on the first error. Combined with the partial-retry support in `RemoteInferenceEngine`, this lets you re-submit only the failed rows instead of re-running the whole batch. + +### Listing Available Models + +Every `InferenceEngine` exposes a `list_models()` method that returns the model IDs usable with that engine: + +```python +from oumi.inference import OpenAIInferenceEngine +from oumi.core.configs import ModelParams + +engine = OpenAIInferenceEngine( + model_params=ModelParams(model_name="gpt-4o-mini") +) + +# Chat-capable models only (default) +print(engine.list_models()) + +# Include non-chat models (e.g. embeddings, moderation) +print(engine.list_models(chat_only=False)) +``` + +For remote engines that expose a `/models` endpoint (OpenAI, Vertex AI, Bedrock, Parasail, Fireworks, and any OpenAI-compatible engine), this queries the provider's API for the live model list. For engines that do not expose a listing API, `list_models()` falls back to returning just the model the engine was initialised with. + +### Token Usage Tracking + +All remote inference engines attach per-message usage metadata to the returned `Conversation`s when the provider reports it. Each final assistant `Message` carries a `usage` dict with: + +- `prompt_tokens`, `completion_tokens`, `total_tokens` — standard OpenAI-style accounting. +- `cached_tokens` — prompt tokens served from a provider-side cache (populated for OpenAI, Anthropic, and Together today). +- `cache_creation_tokens` — tokens *written* to the cache on this request (Anthropic only). + +```python +results = engine.infer(conversations) +for conv in results: + usage = conv.messages[-1].usage or {} + print(usage.get("prompt_tokens"), usage.get("completion_tokens"), + usage.get("cached_tokens")) +``` + +Higher-level components that call inference internally — `BaseJudge`, `AttributeSynthesizer`, and the batch inference flows — accumulate the per-response usage into a run-level total so you can report aggregate cost without re-walking every conversation. See the relevant component docs for their aggregate accessors. + ### Supported Engines The following table shows which engines support batch inference: -| Engine | Batch Support | Notes | -|--------|---------------|-------| -| OpenAI | ✅ Supported | OpenAI Batch API | -| Parasail | ✅ Supported | OpenAI-compatible Batch API | -| Anthropic | 🔜 Coming soon | Message Batches API | -| Together | 🔜 Coming soon | Together Batch API | -| Fireworks | 🔜 Coming soon | Fireworks Batch API | -| DeepSeek | ❌ Not supported | | -| Gemini | ❌ Not supported | | -| Vertex AI | ❌ Not supported | | -| Bedrock | ❌ Not supported | | -| Lambda | ❌ Not supported | | -| SambaNova | ❌ Not supported | | -| OpenRouter | ❌ Not supported | | -| Remote vLLM | ❌ Not supported | | -| SGLang | ❌ Not supported | | +| Engine | Batch Support | Notes | +|------------|-------------------|------------------------------------| +| OpenAI | ✅ Supported | OpenAI Batch API | +| Parasail | ✅ Supported | OpenAI-compatible Batch API | +| Anthropic | ✅ Supported | Message Batches API | +| Together | ✅ Supported | Together Batch API | +| Fireworks | ✅ Supported | Fireworks Batch API | +| Cerebras | ❌ Not supported | | +| DeepSeek | ❌ Not supported | | +| Gemini | ❌ Not supported | | +| Vertex AI | ❌ Not supported | | +| Bedrock | ❌ Not supported | | +| SambaNova | ❌ Not supported | | +| OpenRouter | ❌ Not supported | | +| Remote vLLM| ❌ Not supported | | +| SGLang | ❌ Not supported | | + +Engines that support batch also support `cancel_batch` and `get_batch_results_partial`. ## See Also diff --git a/docs/user_guides/judge/judge.md b/docs/user_guides/judge/judge.md index 4f6cfa150c..9e00cd1ca8 100644 --- a/docs/user_guides/judge/judge.md +++ b/docs/user_guides/judge/judge.md @@ -126,6 +126,121 @@ for output in outputs: explanation = output.field_values["explanation"] # The correct answer is Paris. ``` +## Rule-Based Judges + +```{admonition} Experimental +:class: warning +Rule-based judges are experimental and subject to change. +``` + +Some evaluations don't need an LLM: "does the response contain a phone number?", "does the output avoid the words `error` or `traceback`?", "is the answer an exact match for the expected string?". For these cases Oumi provides {py:class}`~oumi.judges.rule_based_judge.RuleBasedJudge`, which applies a deterministic rule to each input — no inference, no token cost, no LLM variance. + +### Quick Start + +```python +from oumi.judges.rule_based_judge import RuleBasedJudge + +judge = RuleBasedJudge(judge_config="oumi://configs/projects/judges/rule_based/regex_match_phone.yaml") + +outputs = judge.judge([ + {"response": "Call me at 555-1234."}, + {"response": "Send an email."}, +]) + +for out in outputs: + print(out.field_values["judgment"], out.field_scores["judgment"]) +# True 1.0 +# False 0.0 +``` + +### Config Schema + +Rule-based judges reuse {py:class}`~oumi.core.configs.judge_config.JudgeConfig` but drive evaluation from a new `rule_judge_params` block ({py:class}`~oumi.core.configs.params.rule_judge_params.RuleJudgeParams`) instead of calling an LLM. `inference_config` is not required. + +```yaml +judge_params: + prompt_template: "{response}" # still required; placeholders are validated + +rule_judge_params: + rule_type: "regex" # rule registered in the RULE registry + input_fields: ["response"] # fields expected on each input dict + + rule_config: # rule-specific options + pattern: "\\d{3}-\\d{4}" + input_field: "response" + match_mode: "search" # "search" | "match" | "fullmatch" + inverse: false # pass when pattern does NOT match + flags: 0 # optional re.* flag bitmask + + response_format: XML # XML | JSON | RAW + judgment_type: BOOL # BOOL | INT | FLOAT | TEXT | ENUM +``` + +### Built-in Rules + +| Rule | Description | Key `rule_config` options | +|-----------|-----------------------------------------------------------|---------------------------| +| `regex` | Python `re` match against a named input field | `pattern`, `input_field`, `match_mode`, `inverse`, `flags` | + +New rules register themselves via `@register("my_rule", RegistryType.RULE)` on a class that implements {py:class}`~oumi.judges.rules.base_rule.BaseRule` and returns `(judgment: bool, score: float)` from `apply()`. + +### Ready-Made Configs + +| Config | What it checks | +|------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------| +| {gh}`regex_match_phone.yaml ` | Response contains an `XXX-XXXX` phone number | +| {gh}`regex_no_error_keywords.yaml ` | Response does NOT contain `error`, `fail`, `exception`, etc. | + +### CLI Usage + +```bash +oumi judge dataset \ + -c oumi://configs/projects/judges/rule_based/regex_match_phone.yaml \ + --input data/dataset_examples/judge_input.jsonl +``` + +Rule-based judges are run through the same `oumi judge dataset` command as LLM judges — the CLI dispatches to `RuleBasedJudge` automatically when `rule_judge_params` is present in the config. + +## Batch Judging + +For providers that support batch inference (OpenAI, Anthropic, Together, Fireworks, Parasail — see {doc}`inference_engines <../infer/inference_engines>`), `BaseJudge` can submit, poll, and collect judgments asynchronously at reduced cost. + +```python +from oumi.judges.simple_judge import SimpleJudge + +judge = SimpleJudge("oumi://configs/projects/judges/generic/truthfulness.yaml") + +inputs = [{"request": "...", "response": "..."}, ...] + +# Submit as a single batch +batch_id, conversations = judge.judge_batch_submit(inputs) + +# ... later, possibly in a different process ... +# Poll the engine directly if you need a status update +status = judge.inference_engine.get_batch_status(batch_id) + +# Collect when done +outputs = judge.judge_batch_result(batch_id, conversations) # raises on any failure + +# Or tolerate per-row failures: +result = judge.judge_batch_result_partial(batch_id, conversations) +print(f"Succeeded: {len(result.successful)}, failed: {len(result.failed_indices)}") +``` + +`judge_batch_submit` returns the provider batch ID and the `Conversation`s used to build it — you must pass both back to `judge_batch_result(_partial)` so that inputs and outputs can be re-aligned. Rule-based judges don't call inference, so batch judging does not apply to them. + +## Token Usage Tracking + +Both `SimpleJudge` and `RuleBasedJudge` inherit from `BaseJudge`, which accumulates per-request token usage across every call to `judge()` / `judge_batch_result()`. After a run you can read: + +```python +print(judge.total_input_tokens) # sum of prompt_tokens across requests +print(judge.total_output_tokens) # sum of completion_tokens +print(judge.total_cached_tokens) # prompt tokens served from provider cache +``` + +Usage is recorded whether the request went through `infer()` (online) or `infer_batch()` (batch), so the totals are directly comparable across modes. Rule-based judges make no LLM calls and leave these counters at zero. + ## Next Steps - Explore our {doc}`Built-In Judges ` for out-of-the-box evaluation criteria diff --git a/docs/user_guides/launch/launch.md b/docs/user_guides/launch/launch.md index 2104dc8718..051e14bffe 100644 --- a/docs/user_guides/launch/launch.md +++ b/docs/user_guides/launch/launch.md @@ -253,6 +253,39 @@ run: | ```` ::: +:::{tab-item} Nebius +````{dropdown} sample-nebius-job.yaml +```yaml +name: sample-nebius-job + +resources: + cloud: nebius + accelerators: "A100" + # Nebius currently supports on-demand GPU VMs. Spot is not supported. + use_spot: false + disk_size: 500 # Disk size in GBs + +num_nodes: 1 # Set to a larger number for multi-node training. + +working_dir: . + +envs: + OUMI_RUN_NAME: sample.nebius.job + +setup: | + set -e + pip install uv && uv pip install --system 'oumi[gpu]' + +# NOTE: Update this section with your training command. +run: | + set -e # Exit if any command failed. + oumi train -c ./path/to/your/config +``` +```` + +Nebius support is provided through SkyPilot. Enable it the same way as other providers: run `sky check` and follow the instructions for the "Nebius" row (see the [SkyPilot Nebius docs](https://docs.skypilot.co/en/latest/reference/cloud-setup/cloud-permissions/nebius.html)). +::: + :::{tab-item} Lambda ````{dropdown} sample-lambda-job.yaml ```yaml diff --git a/docs/user_guides/mcp.md b/docs/user_guides/mcp.md new file mode 100644 index 0000000000..93e97f3a5d --- /dev/null +++ b/docs/user_guides/mcp.md @@ -0,0 +1,90 @@ +# MCP Server + +```{admonition} Experimental +:class: warning +The Oumi MCP server is under active development (Phase 1). Tools, resources, and prompts may change as we iterate on the integration with MCP-capable assistants. +``` + +Oumi ships an [MCP (Model Context Protocol)](https://modelcontextprotocol.io/) server that lets AI assistants — e.g. Claude Desktop, Claude Code, Cursor — discover Oumi configs, run training/eval/inference jobs, and read workflow guidance without leaving the chat interface. It's installed as a separate extra and launched as a standalone process. + +## Installation + +```bash +pip install "oumi[mcp]" +``` + +This pulls in `fastmcp`, the `mcp` package, and `httpx`. Once installed, a new console script is available: + +```bash +oumi-mcp # starts the MCP server on stdio +python -m oumi.mcp # equivalent +``` + +## Connecting from an MCP Client + +Most MCP-capable clients expect a JSON entry describing how to launch the server. Point them at the `oumi-mcp` script: + +```json +{ + "mcpServers": { + "oumi": { + "command": "oumi-mcp" + } + } +} +``` + +Exact placement depends on the client — Claude Desktop reads `~/Library/Application Support/Claude/claude_desktop_config.json` on macOS; other clients use similar conventions. Refer to your client's MCP docs for the exact path. + +## What's Exposed + +The server surfaces three kinds of MCP primitives: + +**Tools** (assistant-callable functions) + +| Tool | Purpose | +|----------------------|--------------------------------------------------------------| +| `search_configs` | Fuzzy-search the ~500 Oumi YAML configs by path + content | +| `get_config` | Fetch full details (path, model, dataset, raw YAML) for one config | +| `launch_job` | Launch an Oumi job locally or on a cloud provider | +| `poll_status` | Poll a running job's status | +| `stop_cluster` / `down_cluster` | Manage SkyPilot clusters | +| `cancel_job` | Cancel a running job | +| `fetch_logs` | Retrieve logs for a job | +| `list_running_jobs` / `list_completed_jobs` | Inventory of tracked jobs | + +**Resources** (workflow guidance strings the assistant can read) + +- `guidance://mle-workflow` — overall ML engineering workflow +- `guidance://mle-train`, `mle-synth`, `mle-analyze`, `mle-eval`, `mle-infer` — per-command guidance +- `guidance://cloud-launch` — cloud job anatomy and setup patterns +- `guidance://post-training` — post-training steps (download weights, eval, teardown) + +**Prompts** (pre-built prompt templates for common tasks). See the source under `oumi.mcp.prompts` for the current list. + +## Path Rules (Important) + +Because the server executes real commands against the user's machine or cloud account, it's strict about paths: + +- Every path-sensitive tool requires `client_cwd` — the user's project root. +- Config paths may be absolute or relative to `client_cwd`. +- **Local jobs**: the subprocess runs from `client_cwd`; paths inside the YAML resolve against that directory. +- **Cloud jobs**: `client_cwd` becomes `working_dir` on the remote VM. Use repo-relative paths in the YAML — never local-machine absolute paths. + +Assistants should always call the `get_started()` tool first (returned by the server) to retrieve the up-to-date tool catalog and workflow before doing anything else. + +## Debugging + +`oumi-mcp` speaks stdio MCP by default and logs to stderr. To inspect traffic, run it from a terminal and set `OUMI_LOG_LEVEL=DEBUG`: + +```bash +OUMI_LOG_LEVEL=DEBUG oumi-mcp +``` + +Then configure the client to spawn it with the same environment. + +## See Also + +- [Model Context Protocol spec](https://modelcontextprotocol.io/) +- {doc}`/cli/commands` — what the underlying `oumi` CLI can do +- {doc}`/user_guides/launch/launch` — details on cloud job launching diff --git a/docs/user_guides/synth.md b/docs/user_guides/synth.md index 97437fe20f..52461bb493 100644 --- a/docs/user_guides/synth.md +++ b/docs/user_guides/synth.md @@ -121,6 +121,53 @@ input_documents: - path: "textbook.pdf" ``` +**Supported dataset formats** (`input_data`): JSONL, JSON, CSV, TSV, Parquet, and **XLSX**. For XLSX files, every sheet is concatenated into a single dataset, so you can keep related tabs in one workbook. Globs are supported: + +```yaml +input_data: + - path: "data/**/*.xlsx" +``` + +**Supported document formats** (`input_documents`): `.pdf`, `.txt`, `.md`, `.html`, and **`.docx`**. DOCX files are parsed paragraph-by-paragraph. + +```{note} +XLSX / DOCX parsing require the synthesis extras: `pip install oumi[synthesis]`. +``` + +### Few-Shot Sampling From Sources + +When you want each synthesised sample to see *multiple* randomly-drawn items from a source (examples, datasets, or documents), use `num_shots`. This turns the source into a dynamic few-shot pool instead of round-robin enumeration. + +```yaml +input_examples: + - id: few_shot_examples + num_shots: 3 # draw 3 examples per synthesis sample + examples: + - task_type: "summarization" + example_input: "..." + - task_type: "translation" + example_input: "..." + # ... + +generated_attributes: + - id: instruction + instruction_messages: + - role: USER + content: | + Example 1: {few_shot_examples[0].example_input} + Example 2: {few_shot_examples[1].example_input} + Example 3: {few_shot_examples[2].example_input} + Now produce a new, different example. +``` + +Rules: + +- `num_shots: None` or `1` → the source behaves as before (round-robin), reference fields as `{id.field}`. +- `num_shots > 1` → bracket notation `{id[i].field}` is required, and `id` must be set. +- Works uniformly across `input_examples`, `input_data`, and `input_documents`. + +A runnable example lives at {gh}`configs/examples/synthesis/dynamic_few_shot_synth.yaml`. + ### Creating Conversations Build multi-turn dialogues with fixed structure using transformed attributes: @@ -543,6 +590,38 @@ data: dataset_path: "synthetic_qa_dataset.jsonl" ``` +## Batch Inference + +When your synthesis provider supports batch inference (OpenAI, Anthropic, Together, Fireworks, Parasail — see {doc}`/user_guides/infer/inference_engines`), you can submit all prompts for a single attribute as a batch job rather than calling the API online: + +```python +from oumi.core.synthesis.attribute_synthesizer import AttributeSynthesizer + +synth = AttributeSynthesizer(config) + +# Submit a batch job for one generated attribute +batch_id = synth.synthesize_batch(samples, generated_attribute) + +# Later, retrieve results +results = synth.get_batch_results(batch_id, samples, generated_attribute) +# Or tolerate per-row failures: +partial = synth.get_batch_results_partial(batch_id, samples, generated_attribute) +``` + +Batches are typically 50% cheaper than online inference at the cost of a 24-hour completion window. Attributes are batched one at a time (not across attributes), so chained `generated_attributes` still run sequentially. + +## Token Usage Tracking + +`AttributeSynthesizer` accumulates token usage across every online and batch call: + +```python +print(synth.total_input_tokens) # prompt_tokens across all calls +print(synth.total_output_tokens) # completion_tokens +print(synth.total_cached_tokens) # prompt tokens served from provider cache +``` + +Use these counters for cost reporting across an entire synthesis run. See also [Token Usage Tracking](/user_guides/infer/inference_engines.md#token-usage-tracking) on the inference engine side. + ## Best Practices 1. **Start Small**: Begin with a small `num_samples` to test your configuration diff --git a/docs/user_guides/train/training_methods.md b/docs/user_guides/train/training_methods.md index 4c76a969b5..5d85ce90fb 100644 --- a/docs/user_guides/train/training_methods.md +++ b/docs/user_guides/train/training_methods.md @@ -350,6 +350,34 @@ verl requires paths to Parquet files for the training and validation data. Oumi Instead of training a separate reward model which estimates the reward value of a completion, it is common to use reward functions instead. Both the trl and verl frameworks have specific interfaces required for the reward functions used. These are documented in the [trl documentation](https://huggingface.co/docs/trl/main/en/grpo_trainer#using-a-custom-reward-function) and [verl documentation](https://verl.readthedocs.io/en/latest/preparation/reward_function.html) respectively. +#### Per-Function kwargs + +Reward functions that take configuration (e.g. a rubric judge panel path, a strictness flag) can be wired via `training.reward_function_kwargs`. The kwargs are a dict keyed by **reward function name**, with each value being that function's kwargs dict: + +```yaml +training: + trainer_type: "TRL_GRPO" # or VERL_GRPO + + reward_functions: + - rubric_reward + - gsm8k + + reward_function_kwargs: + rubric_reward: + judge_panel_path: "configs/projects/judges/rubric_judge_panel.yaml" + gsm8k: + strict: true +``` + +Rules: + +- Keys in `reward_function_kwargs` must also appear in `reward_functions`; extra keys raise a validation error. +- Omit the key (or use `{}`) for functions with no configuration. +- Configured kwargs take precedence over per-sample kwargs passed by the trainer at call time. +- Only `TRL_GRPO` and `VERL_GRPO` support `reward_function_kwargs` today — setting it with any other trainer raises a validation error. + +This is particularly useful with rubric-based reward functions (see {gh}`RaR datasets `) where a single function is reused across configs with different judge panels or scoring rules. + ### Configuration #### TRL_GRPO