diff --git a/docs/index.md b/docs/index.md
index e2d64a7b41..321e937948 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -60,6 +60,7 @@ get_started/tutorials
 
 user_guides/train/train
 user_guides/infer/infer
+user_guides/deploy
 user_guides/evaluate/evaluate
 user_guides/analyze/analyze
 user_guides/judge/judge
@@ -68,6 +69,7 @@ user_guides/synth
 user_guides/tune
 user_guides/quantization
 user_guides/customization
+user_guides/mcp
 ```
 
 ```{toctree}
diff --git a/docs/user_guides/analyze/analyze.md b/docs/user_guides/analyze/analyze.md
index 43061c9b8a..a8319b8fd4 100644
--- a/docs/user_guides/analyze/analyze.md
+++ b/docs/user_guides/analyze/analyze.md
@@ -71,6 +71,153 @@ The built-in `length` analyzer computes text length metrics:
 Enable token counting by adding `tokenizer_config` to your configuration. See {doc}`analyze_config` for setup details.
 :::
 
+### Data Quality Analyzer
+
+The built-in `quality` analyzer ({py:class}`~oumi.analyze.analyzers.quality.DataQualityAnalyzer`) catches five common data issues without running any model inference. It's meant as a cheap, first-pass sanity check before training or fine-tuning.
+
+| Field                              | What it flags                                                             |
+|------------------------------------|---------------------------------------------------------------------------|
+| `has_non_alternating_turns`        | Consecutive same-role messages (`user`, `user`, …) in non-system turns     |
+| `has_no_user_message`              | Conversation has no `user` message at all (including empty conversations) |
+| `has_system_message_not_at_start`  | A `system` message appears anywhere other than position 0                  |
+| `has_empty_turns` / `empty_turn_count` | Any message whose content is empty or whitespace-only                  |
+| `has_invalid_values` / `invalid_value_patterns` | Strings like `NaN`, `null`, `None`, `undefined` leaked into content |
+
+```yaml
+analyzers:
+  - id: quality
+```
+
+Because the output is typed ({py:class}`~oumi.analyze.analyzers.quality.DataQualityMetrics`), quality fields can be referenced by later **tests** using dotted metric paths (see [Testing Framework](#testing-framework)), e.g. `quality.has_no_user_message`.
+
+### Turn Stats Analyzer
+
+The built-in `turn_stats` analyzer ({py:class}`~oumi.analyze.analyzers.turn_stats.TurnStatsAnalyzer`) reports conversation shape: `num_turns`, `num_user_turns`, `num_assistant_turns`, `has_system_message`, `first_turn_role`, `last_turn_role`. Useful for finding malformed or single-sided conversations.
+
+```yaml
+analyzers:
+  - id: turn_stats
+```
+
+## Typed Analyzer Framework
+
+All built-in analyzers above (`length`, `quality`, `turn_stats`) are implemented in the **typed analyzer framework** ({py:class}`~oumi.analyze.base.BaseAnalyzer`). Each analyzer declares a pydantic result model, which gives you:
+
+- **Auto-generated JSON schemas** for result documentation and validation.
+- **Typed access** to analyzer output in Python (fields are proper attributes, not dict keys).
+- **Metric paths** for the testing framework — `{analyzer_id}.{field_name}`, or `{instance_id}.{field_name}` when you run multiple instances of the same analyzer.
+
+### Defining a Typed Analyzer
+
+```python
+from pydantic import BaseModel, Field
+from oumi.analyze.base import ConversationAnalyzer
+from oumi.core.registry import register_sample_analyzer
+from oumi.core.types.conversation import Conversation
+
+
+class QuestionMetrics(BaseModel):
+    num_questions: int = Field(description="Count of '?' characters")
+    density: float = Field(description="Questions per message")
+
+
+@register_sample_analyzer("questions")
+class QuestionAnalyzer(ConversationAnalyzer[QuestionMetrics]):
+    _result_model = QuestionMetrics
+
+    @classmethod
+    def get_config_schema(cls) -> dict:
+        return {"properties": {}}
+
+    def analyze(self, conversation: Conversation) -> QuestionMetrics:
+        total = sum(m.content.count("?") for m in conversation.messages)
+        return QuestionMetrics(
+            num_questions=total,
+            density=total / max(len(conversation.messages), 1),
+        )
+```
+
+Point the config at your typed analyzer the same way as built-ins:
+
+```yaml
+analyzers:
+  - id: questions
+    instance_id: questions            # required for typed analyzers
+```
+
+When you need two configurations of the same analyzer (e.g. two `length` analyzers with different tokenizers), give each one a unique `instance_id`.
+
+### Custom Metrics (No Code Registration Required)
+
+For quick one-offs you don't want to package as an analyzer, declare a `custom_metrics` block directly in YAML:
+
+```yaml
+custom_metrics:
+  - id: word_to_char_ratio
+    scope: conversation              # message | conversation | dataset
+    description: "Ratio of words to characters"
+    output_schema:
+      - name: ratio
+        type: float
+        description: "Words divided by characters"
+    function: |
+      def compute(conversation):
+          chars = sum(len(m.content) for m in conversation.messages)
+          words = sum(len(m.content.split()) for m in conversation.messages)
+          return {"ratio": words / chars if chars > 0 else 0.0}
+```
+
+```{warning}
+Custom metric `function` strings are compiled and run as arbitrary Python. Only load configs from sources you trust.
+```
+
+## Testing Framework
+
+The typed framework also ships a **testing** layer that evaluates analyzer output against thresholds and produces a pass/fail summary — useful for CI, regression detection, and "fail the run if more than 5% of conversations are missing a user message".
+
+### Defining Tests
+
+```yaml
+tests:
+  - id: max_words
+    type: threshold
+    metric: length.total_words        # <analyzer_id_or_instance_id>.<field>
+    operator: ">"
+    value: 10000
+    max_percentage: 5.0               # fail if >5% of conversations match
+
+  - id: no_missing_user_msg
+    type: threshold
+    metric: quality.has_no_user_message
+    operator: "=="
+    value: true
+    max_percentage: 0.0               # fail if any conversation is missing a user
+```
+
+Each test compares a metric to a `value` using `operator`, then checks whether the flagged fraction exceeds `max_percentage` (or falls below `min_percentage`).
+
+### Running Tests Incrementally with BatchTestEngine
+
+For large datasets where full analyzer output won't fit in memory, use {py:class}`~oumi.analyze.testing.batch_engine.BatchTestEngine`. It accumulates only lightweight counters and per-test affected conversation IDs as batches stream through, then returns a `TestSummary` at the end:
+
+```python
+from oumi.analyze.testing.batch_engine import BatchTestEngine
+
+engine = BatchTestEngine(config.tests)
+
+for batch_results, batch_conversation_ids in stream_batches():
+    engine.process_batch(batch_results, batch_conversation_ids)
+
+summary = engine.finalize()
+print(f"{summary.passed_tests}/{summary.total_tests} passed "
+      f"({summary.pass_rate}%)")
+
+# IDs of conversations that caused test failures, per test:
+affected = engine.get_affected_conversation_ids()
+```
+
+Use the standard `TestEngine` (same module) when the full dataset fits in memory; use `BatchTestEngine` when it doesn't.
+
 ## Working with Results
 
 ### Analysis Summary
diff --git a/docs/user_guides/deploy.md b/docs/user_guides/deploy.md
new file mode 100644
index 0000000000..74f9a0de12
--- /dev/null
+++ b/docs/user_guides/deploy.md
@@ -0,0 +1,121 @@
+# Deploying Models
+
+Oumi provides a top-level `oumi deploy` command for taking a trained or downloaded model and standing it up as a managed inference endpoint on a third-party provider. Today it supports **Fireworks AI** and **Parasail.io**.
+
+```{admonition} Related
+:class: note
+- To *launch training* on remote clusters, see {doc}`/user_guides/launch/launch`.
+- To *call* a deployed endpoint, see {doc}`/user_guides/infer/inference_engines`.
+```
+
+## Overview
+
+The deploy workflow has three stages, each exposed as a sub-command:
+
+1. **Upload** — push the model (full weights or a LoRA adapter) to the provider.
+2. **Create endpoint** — provision hardware and start serving the uploaded model.
+3. **Test / use** — smoke-test the endpoint and then call it with any inference engine.
+
+For the common case, `oumi deploy up` runs all three stages end-to-end from a single YAML config.
+
+## Prerequisites
+
+- A provider account and API key exported in your shell:
+  - Fireworks: `FIREWORKS_API_KEY`
+  - Parasail:  `PARASAIL_API_KEY`
+- For Fireworks, the model must exist on your local disk (HuggingFace download or an Oumi training output).
+
+## Quick Start: End-to-End Deploy
+
+```bash
+oumi deploy up --config configs/examples/deploy/fireworks_deploy.yaml
+```
+
+The `--config` YAML matches the {py:class}`~oumi.deploy.deploy_config.DeploymentConfig` schema:
+
+```yaml
+# configs/examples/deploy/fireworks_deploy.yaml
+model_source: /path/to/my-finetuned-model/   # local directory
+provider: fireworks                           # fireworks | parasail
+model_name: my-finetuned-model-v1             # display name on the provider
+model_type: full                              # full | adapter
+# base_model: accounts/fireworks/models/llama-v3p1-8b-instruct  # required if adapter
+
+hardware:
+  accelerator: nvidia_h100_80gb               # see `oumi deploy list-hardware`
+  count: 2
+
+autoscaling:
+  min_replicas: 1
+  max_replicas: 4
+
+test_prompts:
+  - "Hello, how are you?"
+```
+
+Any of `model_source`, `provider`, and `hardware` can be overridden on the CLI, e.g.:
+
+```bash
+oumi deploy up \
+  --config fireworks_deploy.yaml \
+  --model-path /tmp/llama3-8b \
+  --hardware nvidia_a100_80gb
+```
+
+`oumi deploy up` will upload the model, wait for it to be ready, create an endpoint, optionally run any `test_prompts`, and print the endpoint URL.
+
+## Sub-Commands
+
+| Command                         | What it does                                                         |
+|---------------------------------|----------------------------------------------------------------------|
+| `oumi deploy up`                | Full pipeline: upload → create endpoint → test                        |
+| `oumi deploy upload`            | Upload a model only                                                   |
+| `oumi deploy create-endpoint`   | Create an endpoint for a previously uploaded model                    |
+| `oumi deploy list`              | List all deployments on the provider                                  |
+| `oumi deploy list-models`       | List uploaded models                                                  |
+| `oumi deploy list-hardware`     | List hardware options available for a provider                        |
+| `oumi deploy status`            | Show endpoint state, replica counts, URL                              |
+| `oumi deploy start` / `stop`    | Start or stop an existing endpoint (pause to save cost)               |
+| `oumi deploy delete`            | Delete an endpoint                                                    |
+| `oumi deploy delete-model`      | Delete an uploaded model                                              |
+| `oumi deploy test`              | Send a sample request to an endpoint                                  |
+
+Add `--help` to any sub-command for the exact flags it accepts, or see {doc}`/cli/commands`.
+
+## Using a Deployed Endpoint
+
+Once `oumi deploy up` reports `RUNNING`, point any Oumi inference engine at the returned URL. For Fireworks:
+
+```python
+from oumi.inference import FireworksInferenceEngine
+from oumi.core.configs import ModelParams
+
+engine = FireworksInferenceEngine(
+    model_params=ModelParams(model_name="my-finetuned-model-v1")
+)
+```
+
+For Parasail:
+
+```python
+from oumi.inference import ParasailInferenceEngine
+from oumi.core.configs import ModelParams
+
+engine = ParasailInferenceEngine(
+    model_params=ModelParams(model_name="my-finetuned-model-v1")
+)
+```
+
+Both engines are documented in {doc}`/user_guides/infer/inference_engines`.
+
+## Tips
+
+- **Cost control.** Use `oumi deploy stop <endpoint>` to pause an endpoint without deleting it; `start` brings it back online. Set `autoscaling.min_replicas: 0` if the provider supports scale-to-zero.
+- **LoRA adapters.** Set `model_type: adapter` and a matching `base_model` to deploy a LoRA adapter on top of a hosted base model. This is usually cheaper than a full model.
+- **Smoke tests.** `test_prompts` at the bottom of the YAML run automatically after `oumi deploy up` finishes — quick sanity check before sending real traffic.
+
+## See Also
+
+- {doc}`/user_guides/infer/inference_engines` — calling the deployed endpoint
+- {doc}`/user_guides/launch/launch` — launching training jobs on remote clusters
+- {doc}`/cli/commands` — CLI reference
diff --git a/docs/user_guides/infer/inference_engines.md b/docs/user_guides/infer/inference_engines.md
index 643265e48d..51ddd6ade3 100644
--- a/docs/user_guides/infer/inference_engines.md
+++ b/docs/user_guides/infer/inference_engines.md
@@ -412,6 +412,17 @@ The Anthropic models available via this API as of late Jan'2025 are listed below
 | Claude 3.0 Sonnet                     | claude-3-sonnet-20240229  |
 | Claude 3.0 Haiku                      | claude-3-haiku-20240307   |
 
+**Prompt Caching**
+
+The Anthropic engine enables [prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) automatically by setting `cache_control: ephemeral` on every request. Anthropic caches content up to the last cacheable block, so repeated prefixes (system prompts, long context, multi-turn conversations) re-use the cache and reduce latency and cost.
+
+When caching is active, the per-response usage metadata includes two extra fields:
+
+- `cached_tokens` — tokens served from the prompt cache (populated from Anthropic's `cache_read_input_tokens`).
+- `cache_creation_tokens` — tokens written to the cache on this request (populated from `cache_creation_input_tokens`).
+
+These are exposed on each `Message`'s `usage` metadata alongside `prompt_tokens`, `completion_tokens`, and `total_tokens`. See [Token Usage Tracking](#token-usage-tracking) for accumulating usage across a run.
+
 **Resources**
 
 - [Anthropic API Documentation](https://docs.anthropic.com/en/api/getting-started)
@@ -555,6 +566,38 @@ engine = TogetherInferenceEngine(
 
 The models available via this API can be found at [together.ai](https://www.together.ai/).
 
+**Cache Token Reporting**
+
+The Together engine extracts `cached_tokens` from Together's usage response (a flat field in the `usage` object) and surfaces it on each `Message`'s `usage` metadata. This is useful when comparing cache hit rates across providers — see [Token Usage Tracking](#token-usage-tracking).
+
+### Cerebras
+
+[Cerebras](https://cerebras.ai) offers an extreme-low-latency inference platform backed by wafer-scale hardware. It exposes an OpenAI-compatible chat completions API.
+
+Set the `CEREBRAS_API_KEY` environment variable (get a key from [cloud.cerebras.ai](https://cloud.cerebras.ai/)).
+
+**Basic Usage**
+
+```{testcode}
+from oumi.inference import CerebrasInferenceEngine
+from oumi.core.configs import ModelParams, RemoteParams
+
+engine = CerebrasInferenceEngine(
+    model_params=ModelParams(
+        model_name="llama3.1-8b"
+    )
+)
+```
+
+**Supported Models**
+
+The models available via this API can be found at [inference-docs.cerebras.ai/models](https://inference-docs.cerebras.ai/models).
+
+**Resources**
+
+- [Cerebras Inference Documentation](https://inference-docs.cerebras.ai/)
+- [Available Models](https://inference-docs.cerebras.ai/models)
+
 ### DeepSeek
 
 [DeepSeek](https://deepseek.com) allows to access the DeepSeek models (Chat, Code, and Reasoning) through the DeepSeek AI Platform.
@@ -772,26 +815,93 @@ if status.status.value == "completed":
     results = engine.get_batch_results(batch_id, conversations)
 ```
 
+### Cancelling a Batch Job
+
+Submitted batches can be cancelled before they complete. This is useful if you submitted by mistake, want to change parameters, or realise partial results are enough.
+
+```python
+batch_info = engine.cancel_batch(batch_id)
+print(f"Status: {batch_info.status}")  # typically "cancelling" or "cancelled"
+```
+
+`cancel_batch` returns a `BatchInfo` with the updated status. Cancelling does not delete requests that have already been processed — you can still retrieve partial results (see below).
+
+### Partial Batch Results
+
+When a batch fails, is cancelled, or finishes with some errored rows, you can still recover the successfully completed conversations using `get_batch_results_partial`:
+
+```python
+partial = engine.get_batch_results_partial(batch_id, conversations)
+print(f"Completed: {len(partial.conversations)}")
+print(f"Failed: {len(partial.failures)}")
+
+for failure in partial.failures:
+    print(f"Row {failure.index}: {failure.error}")
+```
+
+This returns a `BatchResult` containing successful conversations and a structured list of failures (with per-row error details), rather than raising on the first error. Combined with the partial-retry support in `RemoteInferenceEngine`, this lets you re-submit only the failed rows instead of re-running the whole batch.
+
+### Listing Available Models
+
+Every `InferenceEngine` exposes a `list_models()` method that returns the model IDs usable with that engine:
+
+```python
+from oumi.inference import OpenAIInferenceEngine
+from oumi.core.configs import ModelParams
+
+engine = OpenAIInferenceEngine(
+    model_params=ModelParams(model_name="gpt-4o-mini")
+)
+
+# Chat-capable models only (default)
+print(engine.list_models())
+
+# Include non-chat models (e.g. embeddings, moderation)
+print(engine.list_models(chat_only=False))
+```
+
+For remote engines that expose a `/models` endpoint (OpenAI, Vertex AI, Bedrock, Parasail, Fireworks, and any OpenAI-compatible engine), this queries the provider's API for the live model list. For engines that do not expose a listing API, `list_models()` falls back to returning just the model the engine was initialised with.
+
+### Token Usage Tracking
+
+All remote inference engines attach per-message usage metadata to the returned `Conversation`s when the provider reports it. Each final assistant `Message` carries a `usage` dict with:
+
+- `prompt_tokens`, `completion_tokens`, `total_tokens` — standard OpenAI-style accounting.
+- `cached_tokens` — prompt tokens served from a provider-side cache (populated for OpenAI, Anthropic, and Together today).
+- `cache_creation_tokens` — tokens *written* to the cache on this request (Anthropic only).
+
+```python
+results = engine.infer(conversations)
+for conv in results:
+    usage = conv.messages[-1].usage or {}
+    print(usage.get("prompt_tokens"), usage.get("completion_tokens"),
+          usage.get("cached_tokens"))
+```
+
+Higher-level components that call inference internally — `BaseJudge`, `AttributeSynthesizer`, and the batch inference flows — accumulate the per-response usage into a run-level total so you can report aggregate cost without re-walking every conversation. See the relevant component docs for their aggregate accessors.
+
 ### Supported Engines
 
 The following table shows which engines support batch inference:
 
-| Engine | Batch Support | Notes |
-|--------|---------------|-------|
-| OpenAI | ✅ Supported | OpenAI Batch API |
-| Parasail | ✅ Supported | OpenAI-compatible Batch API |
-| Anthropic | 🔜 Coming soon | Message Batches API |
-| Together | 🔜 Coming soon | Together Batch API |
-| Fireworks | 🔜 Coming soon | Fireworks Batch API |
-| DeepSeek | ❌ Not supported | |
-| Gemini | ❌ Not supported | |
-| Vertex AI | ❌ Not supported | |
-| Bedrock | ❌ Not supported | |
-| Lambda | ❌ Not supported | |
-| SambaNova | ❌ Not supported | |
-| OpenRouter | ❌ Not supported | |
-| Remote vLLM | ❌ Not supported | |
-| SGLang | ❌ Not supported | |
+| Engine     | Batch Support     | Notes                              |
+|------------|-------------------|------------------------------------|
+| OpenAI     | ✅ Supported      | OpenAI Batch API                   |
+| Parasail   | ✅ Supported      | OpenAI-compatible Batch API        |
+| Anthropic  | ✅ Supported      | Message Batches API                |
+| Together   | ✅ Supported      | Together Batch API                 |
+| Fireworks  | ✅ Supported      | Fireworks Batch API                |
+| Cerebras   | ❌ Not supported  |                                    |
+| DeepSeek   | ❌ Not supported  |                                    |
+| Gemini     | ❌ Not supported  |                                    |
+| Vertex AI  | ❌ Not supported  |                                    |
+| Bedrock    | ❌ Not supported  |                                    |
+| SambaNova  | ❌ Not supported  |                                    |
+| OpenRouter | ❌ Not supported  |                                    |
+| Remote vLLM| ❌ Not supported  |                                    |
+| SGLang     | ❌ Not supported  |                                    |
+
+Engines that support batch also support `cancel_batch` and `get_batch_results_partial`.
 
 ## See Also
 
diff --git a/docs/user_guides/judge/judge.md b/docs/user_guides/judge/judge.md
index 4f6cfa150c..9e00cd1ca8 100644
--- a/docs/user_guides/judge/judge.md
+++ b/docs/user_guides/judge/judge.md
@@ -126,6 +126,121 @@ for output in outputs:
     explanation = output.field_values["explanation"]  # The correct answer is Paris.
 ```
 
+## Rule-Based Judges
+
+```{admonition} Experimental
+:class: warning
+Rule-based judges are experimental and subject to change.
+```
+
+Some evaluations don't need an LLM: "does the response contain a phone number?", "does the output avoid the words `error` or `traceback`?", "is the answer an exact match for the expected string?". For these cases Oumi provides {py:class}`~oumi.judges.rule_based_judge.RuleBasedJudge`, which applies a deterministic rule to each input — no inference, no token cost, no LLM variance.
+
+### Quick Start
+
+```python
+from oumi.judges.rule_based_judge import RuleBasedJudge
+
+judge = RuleBasedJudge(judge_config="oumi://configs/projects/judges/rule_based/regex_match_phone.yaml")
+
+outputs = judge.judge([
+    {"response": "Call me at 555-1234."},
+    {"response": "Send an email."},
+])
+
+for out in outputs:
+    print(out.field_values["judgment"], out.field_scores["judgment"])
+# True 1.0
+# False 0.0
+```
+
+### Config Schema
+
+Rule-based judges reuse {py:class}`~oumi.core.configs.judge_config.JudgeConfig` but drive evaluation from a new `rule_judge_params` block ({py:class}`~oumi.core.configs.params.rule_judge_params.RuleJudgeParams`) instead of calling an LLM. `inference_config` is not required.
+
+```yaml
+judge_params:
+  prompt_template: "{response}"   # still required; placeholders are validated
+
+rule_judge_params:
+  rule_type: "regex"               # rule registered in the RULE registry
+  input_fields: ["response"]       # fields expected on each input dict
+
+  rule_config:                      # rule-specific options
+    pattern: "\\d{3}-\\d{4}"
+    input_field: "response"
+    match_mode: "search"            # "search" | "match" | "fullmatch"
+    inverse: false                   # pass when pattern does NOT match
+    flags: 0                         # optional re.* flag bitmask
+
+  response_format: XML              # XML | JSON | RAW
+  judgment_type: BOOL               # BOOL | INT | FLOAT | TEXT | ENUM
+```
+
+### Built-in Rules
+
+| Rule      | Description                                               | Key `rule_config` options |
+|-----------|-----------------------------------------------------------|---------------------------|
+| `regex`   | Python `re` match against a named input field              | `pattern`, `input_field`, `match_mode`, `inverse`, `flags` |
+
+New rules register themselves via `@register("my_rule", RegistryType.RULE)` on a class that implements {py:class}`~oumi.judges.rules.base_rule.BaseRule` and returns `(judgment: bool, score: float)` from `apply()`.
+
+### Ready-Made Configs
+
+| Config                                                                                                     | What it checks                                                |
+|------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
+| {gh}`regex_match_phone.yaml <configs/projects/judges/rule_based/regex_match_phone.yaml>`                   | Response contains an `XXX-XXXX` phone number                  |
+| {gh}`regex_no_error_keywords.yaml <configs/projects/judges/rule_based/regex_no_error_keywords.yaml>`       | Response does NOT contain `error`, `fail`, `exception`, etc.  |
+
+### CLI Usage
+
+```bash
+oumi judge dataset \
+    -c oumi://configs/projects/judges/rule_based/regex_match_phone.yaml \
+    --input data/dataset_examples/judge_input.jsonl
+```
+
+Rule-based judges are run through the same `oumi judge dataset` command as LLM judges — the CLI dispatches to `RuleBasedJudge` automatically when `rule_judge_params` is present in the config.
+
+## Batch Judging
+
+For providers that support batch inference (OpenAI, Anthropic, Together, Fireworks, Parasail — see {doc}`inference_engines <../infer/inference_engines>`), `BaseJudge` can submit, poll, and collect judgments asynchronously at reduced cost.
+
+```python
+from oumi.judges.simple_judge import SimpleJudge
+
+judge = SimpleJudge("oumi://configs/projects/judges/generic/truthfulness.yaml")
+
+inputs = [{"request": "...", "response": "..."}, ...]
+
+# Submit as a single batch
+batch_id, conversations = judge.judge_batch_submit(inputs)
+
+# ... later, possibly in a different process ...
+# Poll the engine directly if you need a status update
+status = judge.inference_engine.get_batch_status(batch_id)
+
+# Collect when done
+outputs = judge.judge_batch_result(batch_id, conversations)  # raises on any failure
+
+# Or tolerate per-row failures:
+result = judge.judge_batch_result_partial(batch_id, conversations)
+print(f"Succeeded: {len(result.successful)}, failed: {len(result.failed_indices)}")
+```
+
+`judge_batch_submit` returns the provider batch ID and the `Conversation`s used to build it — you must pass both back to `judge_batch_result(_partial)` so that inputs and outputs can be re-aligned. Rule-based judges don't call inference, so batch judging does not apply to them.
+
+## Token Usage Tracking
+
+Both `SimpleJudge` and `RuleBasedJudge` inherit from `BaseJudge`, which accumulates per-request token usage across every call to `judge()` / `judge_batch_result()`. After a run you can read:
+
+```python
+print(judge.total_input_tokens)    # sum of prompt_tokens across requests
+print(judge.total_output_tokens)   # sum of completion_tokens
+print(judge.total_cached_tokens)   # prompt tokens served from provider cache
+```
+
+Usage is recorded whether the request went through `infer()` (online) or `infer_batch()` (batch), so the totals are directly comparable across modes. Rule-based judges make no LLM calls and leave these counters at zero.
+
 ## Next Steps
 
 - Explore our {doc}`Built-In Judges </user_guides/judge/built_in_judges>` for out-of-the-box evaluation criteria
diff --git a/docs/user_guides/launch/launch.md b/docs/user_guides/launch/launch.md
index 2104dc8718..051e14bffe 100644
--- a/docs/user_guides/launch/launch.md
+++ b/docs/user_guides/launch/launch.md
@@ -253,6 +253,39 @@ run: |
 ````
 :::
 
+:::{tab-item} Nebius
+````{dropdown} sample-nebius-job.yaml
+```yaml
+name: sample-nebius-job
+
+resources:
+  cloud: nebius
+  accelerators: "A100"
+  # Nebius currently supports on-demand GPU VMs. Spot is not supported.
+  use_spot: false
+  disk_size: 500  # Disk size in GBs
+
+num_nodes: 1  # Set to a larger number for multi-node training.
+
+working_dir: .
+
+envs:
+  OUMI_RUN_NAME: sample.nebius.job
+
+setup: |
+  set -e
+  pip install uv && uv pip install --system 'oumi[gpu]'
+
+# NOTE: Update this section with your training command.
+run: |
+  set -e  # Exit if any command failed.
+  oumi train -c ./path/to/your/config
+```
+````
+
+Nebius support is provided through SkyPilot. Enable it the same way as other providers: run `sky check` and follow the instructions for the "Nebius" row (see the [SkyPilot Nebius docs](https://docs.skypilot.co/en/latest/reference/cloud-setup/cloud-permissions/nebius.html)).
+:::
+
 :::{tab-item} Lambda
 ````{dropdown} sample-lambda-job.yaml
 ```yaml
diff --git a/docs/user_guides/mcp.md b/docs/user_guides/mcp.md
new file mode 100644
index 0000000000..93e97f3a5d
--- /dev/null
+++ b/docs/user_guides/mcp.md
@@ -0,0 +1,90 @@
+# MCP Server
+
+```{admonition} Experimental
+:class: warning
+The Oumi MCP server is under active development (Phase 1). Tools, resources, and prompts may change as we iterate on the integration with MCP-capable assistants.
+```
+
+Oumi ships an [MCP (Model Context Protocol)](https://modelcontextprotocol.io/) server that lets AI assistants — e.g. Claude Desktop, Claude Code, Cursor — discover Oumi configs, run training/eval/inference jobs, and read workflow guidance without leaving the chat interface. It's installed as a separate extra and launched as a standalone process.
+
+## Installation
+
+```bash
+pip install "oumi[mcp]"
+```
+
+This pulls in `fastmcp`, the `mcp` package, and `httpx`. Once installed, a new console script is available:
+
+```bash
+oumi-mcp          # starts the MCP server on stdio
+python -m oumi.mcp   # equivalent
+```
+
+## Connecting from an MCP Client
+
+Most MCP-capable clients expect a JSON entry describing how to launch the server. Point them at the `oumi-mcp` script:
+
+```json
+{
+  "mcpServers": {
+    "oumi": {
+      "command": "oumi-mcp"
+    }
+  }
+}
+```
+
+Exact placement depends on the client — Claude Desktop reads `~/Library/Application Support/Claude/claude_desktop_config.json` on macOS; other clients use similar conventions. Refer to your client's MCP docs for the exact path.
+
+## What's Exposed
+
+The server surfaces three kinds of MCP primitives:
+
+**Tools** (assistant-callable functions)
+
+| Tool                 | Purpose                                                      |
+|----------------------|--------------------------------------------------------------|
+| `search_configs`     | Fuzzy-search the ~500 Oumi YAML configs by path + content    |
+| `get_config`         | Fetch full details (path, model, dataset, raw YAML) for one config |
+| `launch_job`         | Launch an Oumi job locally or on a cloud provider            |
+| `poll_status`        | Poll a running job's status                                  |
+| `stop_cluster` / `down_cluster` | Manage SkyPilot clusters                            |
+| `cancel_job`         | Cancel a running job                                         |
+| `fetch_logs`         | Retrieve logs for a job                                      |
+| `list_running_jobs` / `list_completed_jobs` | Inventory of tracked jobs                 |
+
+**Resources** (workflow guidance strings the assistant can read)
+
+- `guidance://mle-workflow` — overall ML engineering workflow
+- `guidance://mle-train`, `mle-synth`, `mle-analyze`, `mle-eval`, `mle-infer` — per-command guidance
+- `guidance://cloud-launch` — cloud job anatomy and setup patterns
+- `guidance://post-training` — post-training steps (download weights, eval, teardown)
+
+**Prompts** (pre-built prompt templates for common tasks). See the source under `oumi.mcp.prompts` for the current list.
+
+## Path Rules (Important)
+
+Because the server executes real commands against the user's machine or cloud account, it's strict about paths:
+
+- Every path-sensitive tool requires `client_cwd` — the user's project root.
+- Config paths may be absolute or relative to `client_cwd`.
+- **Local jobs**: the subprocess runs from `client_cwd`; paths inside the YAML resolve against that directory.
+- **Cloud jobs**: `client_cwd` becomes `working_dir` on the remote VM. Use repo-relative paths in the YAML — never local-machine absolute paths.
+
+Assistants should always call the `get_started()` tool first (returned by the server) to retrieve the up-to-date tool catalog and workflow before doing anything else.
+
+## Debugging
+
+`oumi-mcp` speaks stdio MCP by default and logs to stderr. To inspect traffic, run it from a terminal and set `OUMI_LOG_LEVEL=DEBUG`:
+
+```bash
+OUMI_LOG_LEVEL=DEBUG oumi-mcp
+```
+
+Then configure the client to spawn it with the same environment.
+
+## See Also
+
+- [Model Context Protocol spec](https://modelcontextprotocol.io/)
+- {doc}`/cli/commands` — what the underlying `oumi` CLI can do
+- {doc}`/user_guides/launch/launch` — details on cloud job launching
diff --git a/docs/user_guides/synth.md b/docs/user_guides/synth.md
index 97437fe20f..52461bb493 100644
--- a/docs/user_guides/synth.md
+++ b/docs/user_guides/synth.md
@@ -121,6 +121,53 @@ input_documents:
   - path: "textbook.pdf"
 ```
 
+**Supported dataset formats** (`input_data`): JSONL, JSON, CSV, TSV, Parquet, and **XLSX**. For XLSX files, every sheet is concatenated into a single dataset, so you can keep related tabs in one workbook. Globs are supported:
+
+```yaml
+input_data:
+  - path: "data/**/*.xlsx"
+```
+
+**Supported document formats** (`input_documents`): `.pdf`, `.txt`, `.md`, `.html`, and **`.docx`**. DOCX files are parsed paragraph-by-paragraph.
+
+```{note}
+XLSX / DOCX parsing require the synthesis extras: `pip install oumi[synthesis]`.
+```
+
+### Few-Shot Sampling From Sources
+
+When you want each synthesised sample to see *multiple* randomly-drawn items from a source (examples, datasets, or documents), use `num_shots`. This turns the source into a dynamic few-shot pool instead of round-robin enumeration.
+
+```yaml
+input_examples:
+  - id: few_shot_examples
+    num_shots: 3                       # draw 3 examples per synthesis sample
+    examples:
+      - task_type: "summarization"
+        example_input: "..."
+      - task_type: "translation"
+        example_input: "..."
+      # ...
+
+generated_attributes:
+  - id: instruction
+    instruction_messages:
+      - role: USER
+        content: |
+          Example 1: {few_shot_examples[0].example_input}
+          Example 2: {few_shot_examples[1].example_input}
+          Example 3: {few_shot_examples[2].example_input}
+          Now produce a new, different example.
+```
+
+Rules:
+
+- `num_shots: None` or `1` → the source behaves as before (round-robin), reference fields as `{id.field}`.
+- `num_shots > 1` → bracket notation `{id[i].field}` is required, and `id` must be set.
+- Works uniformly across `input_examples`, `input_data`, and `input_documents`.
+
+A runnable example lives at {gh}`configs/examples/synthesis/dynamic_few_shot_synth.yaml`.
+
 ### Creating Conversations
 
 Build multi-turn dialogues with fixed structure using transformed attributes:
@@ -543,6 +590,38 @@ data:
         dataset_path: "synthetic_qa_dataset.jsonl"
 ```
 
+## Batch Inference
+
+When your synthesis provider supports batch inference (OpenAI, Anthropic, Together, Fireworks, Parasail — see {doc}`/user_guides/infer/inference_engines`), you can submit all prompts for a single attribute as a batch job rather than calling the API online:
+
+```python
+from oumi.core.synthesis.attribute_synthesizer import AttributeSynthesizer
+
+synth = AttributeSynthesizer(config)
+
+# Submit a batch job for one generated attribute
+batch_id = synth.synthesize_batch(samples, generated_attribute)
+
+# Later, retrieve results
+results = synth.get_batch_results(batch_id, samples, generated_attribute)
+# Or tolerate per-row failures:
+partial = synth.get_batch_results_partial(batch_id, samples, generated_attribute)
+```
+
+Batches are typically 50% cheaper than online inference at the cost of a 24-hour completion window. Attributes are batched one at a time (not across attributes), so chained `generated_attributes` still run sequentially.
+
+## Token Usage Tracking
+
+`AttributeSynthesizer` accumulates token usage across every online and batch call:
+
+```python
+print(synth.total_input_tokens)    # prompt_tokens across all calls
+print(synth.total_output_tokens)   # completion_tokens
+print(synth.total_cached_tokens)   # prompt tokens served from provider cache
+```
+
+Use these counters for cost reporting across an entire synthesis run. See also [Token Usage Tracking](/user_guides/infer/inference_engines.md#token-usage-tracking) on the inference engine side.
+
 ## Best Practices
 
 1. **Start Small**: Begin with a small `num_samples` to test your configuration
diff --git a/docs/user_guides/train/training_methods.md b/docs/user_guides/train/training_methods.md
index 4c76a969b5..5d85ce90fb 100644
--- a/docs/user_guides/train/training_methods.md
+++ b/docs/user_guides/train/training_methods.md
@@ -350,6 +350,34 @@ verl requires paths to Parquet files for the training and validation data. Oumi
 
 Instead of training a separate reward model which estimates the reward value of a completion, it is common to use reward functions instead. Both the trl and verl frameworks have specific interfaces required for the reward functions used. These are documented in the [trl documentation](https://huggingface.co/docs/trl/main/en/grpo_trainer#using-a-custom-reward-function) and [verl documentation](https://verl.readthedocs.io/en/latest/preparation/reward_function.html) respectively.
 
+#### Per-Function kwargs
+
+Reward functions that take configuration (e.g. a rubric judge panel path, a strictness flag) can be wired via `training.reward_function_kwargs`. The kwargs are a dict keyed by **reward function name**, with each value being that function's kwargs dict:
+
+```yaml
+training:
+  trainer_type: "TRL_GRPO"   # or VERL_GRPO
+
+  reward_functions:
+    - rubric_reward
+    - gsm8k
+
+  reward_function_kwargs:
+    rubric_reward:
+      judge_panel_path: "configs/projects/judges/rubric_judge_panel.yaml"
+    gsm8k:
+      strict: true
+```
+
+Rules:
+
+- Keys in `reward_function_kwargs` must also appear in `reward_functions`; extra keys raise a validation error.
+- Omit the key (or use `{}`) for functions with no configuration.
+- Configured kwargs take precedence over per-sample kwargs passed by the trainer at call time.
+- Only `TRL_GRPO` and `VERL_GRPO` support `reward_function_kwargs` today — setting it with any other trainer raises a validation error.
+
+This is particularly useful with rubric-based reward functions (see {gh}`RaR datasets <src/oumi/datasets/grpo/rar_dataset.py>`) where a single function is reused across configs with different judge panels or scoring rules.
+
 ### Configuration
 
 #### TRL_GRPO