Measuring Reasoning with Evalchemy

January 30, 2025

If you can't measure it, you can't improve it. At Open Thoughts, we are on a mission to build the best open reasoning datasets (and therefore, the best open reasoning models). We are sharing everything publicly on our journey including the tools we are using to get there. Today we are releasing reasoning benchmarks into our model evaluation tool Evalchemy

Model evaluations are the important feedback signal in the experimental feedback loop. Measuring the effectiveness of a particular data curation strategy allows us to know what works and what doesn't. These evaluations need to be reliable, repeatable, easy to use and fast. This is why we built Evalchemy

Evalchemy is a unified interface for evaluating post-trained LLMs. Built off the popular lm-evaluation-harness by EleutherAI, we have added additional benchmarks and support for evaluating more API-based models. 

As part of the Open Thoughts project, Evalchemy now includes the common reasoning benchmarks AIME24, AMC23, MATH500, LiveCodeBench, and GPQA-Diamond. Coding evaluations HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E and CRUXEval have also joined the expanding list of available benchmarks

MethodModel NameAIME2024MATH500GPQA-Diamond
Evalchemy EvalDeepSeek-R1-Distill-Qwen-7B60.088.246.9
R1 ReportDeepSeek-R1-Distill-Qwen-7B55.583.349.1
Evalchemy Evalgpt-4o-2024-08-0610.075.846.5
OpenAI Reportgpt-4o9.360.350.6
Evalchemy Evalo1-mini63.085.660.0
OpenAI Reporto1-mini-90.060.0
Evalchemy EvalDeepSeek-R186.791.671.2
R1 ReportDeepSeek-R179.897.371.5

In the table above we show our evaluation results for reasoning benchmarks on popular models compared to the publicly reported numbers. 

We are continuously improving Evalchemy. If there is a benchmark you would like to see added, please raise an issue on Github, or even better, open a pull request, as we encourage contributions from the community.