PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.
What makes it different:
- Tool usage — Can the model call the right tools with the right parameters?
- Multi-step reasoning — Can it chain together actions to complete complex tasks?
- Real-world messiness — Can it handle ambiguous instructions and incomplete information?
- Practical outcomes — Did it actually create the file, send the email, or schedule the meeting?
Results are collected on a public leaderboard at pinchbench.com.
Built with 🦀 as part of the OpenClaw ecosystem.