rust_swe_agent

A minimal harness for operators who want to own the SWE agent loop: start with a small bash-first agent, swap MCP toolsets at invocation time, produce inspectable trajectories, and measure each change before adding more machinery.

Getting Started

Prerequisites

Rust toolchain 1.85 or newer, matching the crate rust-version.
Git on PATH.
Optional: Docker, only when running isolated environments with a binary built with the docker feature.
Optional for live-model runs only: the provider credential expected by LiteLLM-style routing, such as ANTHROPIC_API_KEY for claude* models or OPENAI_API_KEY for OpenAI-routed models.

1. Run The No-Key Smoke Path

This path costs $0 and performs no network model call. The hello-world command uses a scripted deterministic model, runs one local shell command, then writes a canonical trajectory and final-output artifact.

PowerShell:

cargo run --quiet -- --log error hello-world --output runs/quickstart

macOS/Linux:

cargo run --quiet -- --log error hello-world --output runs/quickstart

Expected stdout:

hello-world smoke complete
trajectory: runs/quickstart/hello-world.traj.json
output: runs/quickstart/hello-world.output.txt

The trajectory at runs/quickstart/hello-world.traj.json should parse as mini-swe-agent-1.1, have outcome: "submitted", and record total_cost_usd: 0.0.

2. Inspect The Trajectory

PowerShell:

cargo run --quiet -- --log error bench inspect --sweep runs/quickstart --instance hello-world

macOS/Linux:

cargo run --quiet -- --log error bench inspect --sweep runs/quickstart --instance hello-world

This is the core operator loop before any sweep: run one task, inspect the trajectory, then decide whether the model, prompt, budget, and environment are ready for a broader run.

3. Optional Preflight Before SWE-bench

Use doctor on a local SWE-bench JSONL dataset before launching work. This checks the dataset and environment setup; --skip-model-probe keeps this preflight from touching a model provider.

PowerShell:

cargo run --quiet -- --log error bench doctor --dataset-path .\data\swebench.jsonl --output runs\doctor --limit 1 --skip-model-probe

macOS/Linux:

cargo run --quiet -- --log error bench doctor --dataset-path ./data/swebench.jsonl --output runs/doctor --limit 1 --skip-model-probe

After credentials are set and you are ready to spend a small calibration budget, run forecast before a full sweep:

cargo run --quiet -- --log info bench forecast --dataset-path ./data/swebench.jsonl --output runs/forecast --limit 5 --calibration-n 2 --sweep-cost-limit-usd 1.00 --format json > runs/forecast.json

4. Close The Calibration Loop

For paid sweeps, treat the operator loop as doctor -> forecast -> swebench -> calibrate. The forecast keeps the first spend bounded; the calibration report tells you whether that forecast was trustworthy after the real sweep completes.

cargo run --quiet -- --log error bench doctor --dataset-path ./data/swebench.jsonl --output runs/doctor --limit 5 --skip-model-probe
cargo run --quiet -- --log info bench forecast --dataset-path ./data/swebench.jsonl --output runs/forecast --limit 5 --calibration-n 2 --sweep-cost-limit-usd 1.00 --format json > runs/forecast.json
cargo run --quiet -- --log info bench swebench --dataset-path ./data/swebench.jsonl --output runs/sweep --limit 5 --sweep-cost-limit-usd 1.00
cargo run --quiet -- --log error bench calibrate --forecast runs/forecast.json --results runs/sweep/results.json --output runs/sweep/calibration.json --fail-on-optimistic

bench calibrate prints a compact summary, writes a versioned calibration_report, classifies each metric as within_interval, over_upper, or under_lower, and exits with calibration_optimistic when --fail-on-optimistic is set and actuals overshot the forecast. The budget seance gets a receipt.

Live-Model Quickstart

Keep the paid path separate from the no-key smoke path. Set the credential for the model family you choose, keep the task local and tiny, and set both a step limit and a per-task budget. Local execution runs model-generated shell commands from this checkout, so treat it as trusted-code execution. Use Docker isolation when available for untrusted repositories or tasks; see docs/spec-interactive-mode.md for the local execution safety rationale.

PowerShell:

$env:ANTHROPIC_API_KEY = "<your Anthropic key>"
cargo run --quiet -- --log info mini --task "Create runs/live-task/hello.txt containing hello from rust_swe_agent." --model claude-opus-4-7 --env local --output runs/live-quickstart --trajectory-name live-hello --step-limit 8 --task-timeout-secs 300 --per-task-budget-usd 0.25
cargo run --quiet -- --log error bench inspect --sweep runs/live-quickstart --instance live-hello

macOS/Linux:

export ANTHROPIC_API_KEY="<your Anthropic key>"
cargo run --quiet -- --log info mini --task "Create runs/live-task/hello.txt containing hello from rust_swe_agent." --model claude-opus-4-7 --env local --output runs/live-quickstart --trajectory-name live-hello --step-limit 8 --task-timeout-secs 300 --per-task-budget-usd 0.25
cargo run --quiet -- --log error bench inspect --sweep runs/live-quickstart --instance live-hello

For a real SWE-bench sweep, run bench doctor first, then bench forecast with a cost cap, then bench swebench only after the forecast clears your budget, and finally bench calibrate against the completed results.json. This avoids beginning with a multi-instance spendfest and leaves a durable calibration record. Tiny mercy.

Troubleshooting

Symptom	Likely Cause	Fix
`cargo` is not recognized or `rustc` is too old	Missing Rust or a toolchain older than 1.85	Install/update Rust with rustup, then run `rustc --version`
`git` is not recognized	Git is missing from `PATH`	Install Git and open a new shell
Live run fails with missing credentials	Provider API key is not set	Set `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, or the provider-specific key before `mini`, `forecast`, or `swebench`
Docker run fails before the agent starts	Docker is unavailable or the binary lacks the `docker` feature	Start Docker, or use `--env local`; build with the Docker feature before selecting `--env docker`
Smoke run cannot write artifacts	Output directory is unwritable	Choose a writable `--output` path, for example `runs/quickstart` inside the repo
`bench doctor` reports dataset read/parse errors	The `--dataset-path` value is missing, points at a directory, or is not JSONL	Pass a readable SWE-bench JSONL file and rerun `bench doctor --skip-model-probe`

Advanced Specs

Start with the first-run path above, then use these deeper specs once you have a valid trajectory in hand:

configuration reference: every config field, default value, valid values, precedence rules, copy-pasteable TOML examples, and secret handling guidance. Start here before tuning a sweep.
bench tail: live aggregate progress, cost burn, ETA, and failure mix for running SWE-bench sweeps.
bench evaluate: evaluator output, rerun metrics, pass@k, and compare regression gates.
bench triage: deterministic unresolved-failure clustering, ranked stdout tables, and the triage.json schema.
agent scriptability: invocation-time MCP servers plus PreToolUse and PostToolUse hooks for A/B testing agent toolsets without rebuilding Rust.
streaming: SSE and webhook event surfaces for observing runs while they execute.
secret redaction: redaction guarantees for trajectories, inspect output, streams, and patch artifacts.
bench reproduce: replay a saved sweep from its manifest, detect environment drift, and write a reproducibility.json comparison artifact.

Nightly E2E smoke

.github/workflows/swe-bench-nightly.yml runs a single SWE-bench Lite instance through the full bench swebench pipeline against an OpenRouter free-tier model (openrouter/deepseek/deepseek-chat-v3.1:free). It exists to catch harness regressions, not to track solve rate — the run passes whenever the sweep reports errored == 0 in results.json. Trajectories and the input dataset are uploaded as artifacts on every run; scheduled failures auto-open a nightly-smoke issue. Requires the OPENROUTER_API_KEY repository secret.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
codecov.yml		codecov.yml
test_mermaid.rs		test_mermaid.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rust_swe_agent

Getting Started

Prerequisites

1. Run The No-Key Smoke Path

2. Inspect The Trajectory

3. Optional Preflight Before SWE-bench

4. Close The Calibration Loop

Live-Model Quickstart

Troubleshooting

Advanced Specs

Nightly E2E smoke

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rust_swe_agent

Getting Started

Prerequisites

1. Run The No-Key Smoke Path

2. Inspect The Trajectory

3. Optional Preflight Before SWE-bench

4. Close The Calibration Loop

Live-Model Quickstart

Troubleshooting

Advanced Specs

Nightly E2E smoke

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages