2 min read
PinchBench

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.

What makes it different:

  • Tool usage — Can the model call the right tools with the right parameters?
  • Multi-step reasoning — Can it chain together actions to complete complex tasks?
  • Real-world messiness — Can it handle ambiguous instructions and incomplete information?
  • Practical outcomes — Did it actually create the file, send the email, or schedule the meeting?

Results are collected on a public leaderboard at pinchbench.com.

Built with 🦀 as part of the OpenClaw ecosystem.

© 2026 | Brendan O'Leary

The views here are mine alone not my employer's, not anyone else's.