Hi AgentBench team,
Thanks for AgentBench (ICLR'24) — it has been a foundational reference for the multi-environment agent evaluation community.
I'm flagging a potentially complementary benchmark for your radar:
ClawBench — live-production browser agent evaluation
Where AgentBench covers 8 environments (OS, DB, KG, card game, etc.) with static or simulated tasks, ClawBench focuses specifically on the browser environment and runs on live production websites.
- 153 everyday web tasks on 144 live production websites across 15 life categories (food delivery, job search, e-commerce, travel, etc. — Grubhub, Indeed, Amazon, United, and so on)
- 7 frontier models evaluated; top passes 33.3%
- Unique mechanism: a Chrome-extension plus CDP (Chrome DevTools Protocol) submission-interception layer blocks only the final write request, so agents can execute real flows (checkout, job applications, forms) end-to-end on the live site without triggering real-world side effects — no actual orders placed, no real applications submitted
This fills a gap that is hard to cover with static snapshots or simulators: the tasks are exactly what users do on the web every day, with full live DOM dynamics, login states, and anti-bot variability, but without the externality problem.
Links
If this fits in AgentBench's web-environment coverage or as a referenced comparison, happy to contribute. No action required — just flagging for your radar.
Disclosure
I am affiliated with the ClawBench project.
Hi AgentBench team,
Thanks for AgentBench (ICLR'24) — it has been a foundational reference for the multi-environment agent evaluation community.
I'm flagging a potentially complementary benchmark for your radar:
ClawBench — live-production browser agent evaluation
Where AgentBench covers 8 environments (OS, DB, KG, card game, etc.) with static or simulated tasks, ClawBench focuses specifically on the browser environment and runs on live production websites.
This fills a gap that is hard to cover with static snapshots or simulators: the tasks are exactly what users do on the web every day, with full live DOM dynamics, login states, and anti-bot variability, but without the externality problem.
Links
If this fits in AgentBench's web-environment coverage or as a referenced comparison, happy to contribute. No action required — just flagging for your radar.
Disclosure
I am affiliated with the ClawBench project.