If you can't measure it, you can't improve it. At Open Thoughts, we are on a mission to build the best open reasoning datasets (and therefore, the best open reasoning models). We are sharing everything publicly on our journey including the tools we are using to get there. Today we are releasing reasoning benchmarks into our model evaluation tool Evalchemy.
Model evaluations are the important feedback signal in the experimental feedback loop. Measuring the effectiveness of a particular data curation strategy allows us to know what works and what doesn't. These evaluations need to be reliable, repeatable, easy to use and fast. This is why we built Evalchemy.
Evalchemy is a unified interface for evaluating post-trained LLMs. Built off the popular lm-evaluation-harness by EleutherAI, we have added additional benchmarks and support for evaluating more API-based models.
As part of the Open Thoughts project, Evalchemy now includes the common reasoning benchmarks AIME24, AMC23, MATH500, LiveCodeBench, and GPQA-Diamond. Coding evaluations HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E and CRUXEval have also joined the expanding list of available benchmarks.
Method | Model Name | AIME2024 | MATH500 | GPQA-Diamond |
---|---|---|---|---|
Evalchemy Eval | DeepSeek-R1-Distill-Qwen-7B | 60.0 | 88.2 | 46.9 |
R1 Report | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 49.1 |
Evalchemy Eval | gpt-4o-2024-08-06 | 10.0 | 75.8 | 46.5 |
OpenAI Report | gpt-4o | 9.3 | 60.3 | 50.6 |
Evalchemy Eval | o1-mini | 63.0 | 85.6 | 60.0 |
OpenAI Report | o1-mini | - | 90.0 | 60.0 |
Evalchemy Eval | DeepSeek-R1 | 86.7 | 91.6 | 71.2 |
R1 Report | DeepSeek-R1 | 79.8 | 97.3 | 71.5 |
In the table above we show our evaluation results for reasoning benchmarks on popular models compared to the publicly reported numbers.
We are continuously improving Evalchemy. If there is a benchmark you would like to see added, please raise an issue on Github, or even better, open a pull request, as we encourage contributions from the community.