PinchBench | Brendan O'Leary

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.

What makes it different:

Tool usage — Can the model call the right tools with the right parameters?
Multi-step reasoning — Can it chain together actions to complete complex tasks?
Real-world messiness — Can it handle ambiguous instructions and incomplete information?
Practical outcomes — Did it actually create the file, send the email, or schedule the meeting?

Results are collected on a public leaderboard at pinchbench.com.

Built with 🦀 as part of the OpenClaw ecosystem.