Comparison benchmark: ClawBench (live-site browser agent eval with submission-interception)

Hi AgentBench team,

Thanks for AgentBench (ICLR'24) — it has been a foundational reference for the multi-environment agent evaluation community.

I'm flagging a potentially complementary benchmark for your radar:

## ClawBench — live-production browser agent evaluation

Where AgentBench covers 8 environments (OS, DB, KG, card game, etc.) with static or simulated tasks, ClawBench focuses specifically on the **browser** environment and runs on **live production websites**.

- 153 everyday web tasks on 144 live production websites across 15 life categories (food delivery, job search, e-commerce, travel, etc. — Grubhub, Indeed, Amazon, United, and so on)
- 7 frontier models evaluated; top passes 33.3%
- Unique mechanism: a Chrome-extension plus CDP (Chrome DevTools Protocol) submission-interception layer blocks only the final write request, so agents can execute real flows (checkout, job applications, forms) end-to-end on the live site without triggering real-world side effects — no actual orders placed, no real applications submitted

This fills a gap that is hard to cover with static snapshots or simulators: the tasks are exactly what users do on the web every day, with full live DOM dynamics, login states, and anti-bot variability, but without the externality problem.

## Links
- Paper: https://arxiv.org/abs/2604.08523
- Repo: https://github.com/reacher-z/ClawBench
- HF dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench
- Site: https://claw-bench.com

If this fits in AgentBench's web-environment coverage or as a referenced comparison, happy to contribute. No action required — just flagging for your radar.

## Disclosure
I am affiliated with the ClawBench project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison benchmark: ClawBench (live-site browser agent eval with submission-interception) #221

ClawBench — live-production browser agent evaluation

Links

Disclosure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Comparison benchmark: ClawBench (live-site browser agent eval with submission-interception) #221

Description

ClawBench — live-production browser agent evaluation

Links

Disclosure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions