Skip to content

Comparison benchmark: ClawBench (live-site browser agent eval with submission-interception) #221

@reacher-z

Description

@reacher-z

Hi AgentBench team,

Thanks for AgentBench (ICLR'24) — it has been a foundational reference for the multi-environment agent evaluation community.

I'm flagging a potentially complementary benchmark for your radar:

ClawBench — live-production browser agent evaluation

Where AgentBench covers 8 environments (OS, DB, KG, card game, etc.) with static or simulated tasks, ClawBench focuses specifically on the browser environment and runs on live production websites.

  • 153 everyday web tasks on 144 live production websites across 15 life categories (food delivery, job search, e-commerce, travel, etc. — Grubhub, Indeed, Amazon, United, and so on)
  • 7 frontier models evaluated; top passes 33.3%
  • Unique mechanism: a Chrome-extension plus CDP (Chrome DevTools Protocol) submission-interception layer blocks only the final write request, so agents can execute real flows (checkout, job applications, forms) end-to-end on the live site without triggering real-world side effects — no actual orders placed, no real applications submitted

This fills a gap that is hard to cover with static snapshots or simulators: the tasks are exactly what users do on the web every day, with full live DOM dynamics, login states, and anti-bot variability, but without the externality problem.

Links

If this fits in AgentBench's web-environment coverage or as a referenced comparison, happy to contribute. No action required — just flagging for your radar.

Disclosure

I am affiliated with the ClawBench project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions