Measuring Reasoning with Evalchemy

If you can't measure it, you can't improve it. At Open Thoughts, we are on a mission to build the best open reasoning datasets (and therefore, the best open reasoning models). We are sharing everything publicly on our journey including the tools we are using to get there. Today we are releasing reasoning benchmarks into our model evaluation tool Evalchemy.

Model evaluations are the important feedback signal in the experimental feedback loop. Measuring the effectiveness of a particular data curation strategy allows us to know what works and what doesn't. These evaluations need to be reliable, repeatable, easy to use and fast. This is why we built Evalchemy.

Evalchemy is a unified interface for evaluating post-trained LLMs. Built off the popular lm-evaluation-harness by EleutherAI, we have added additional benchmarks and support for evaluating more API-based models.

As part of the Open Thoughts project, Evalchemy now includes the common reasoning benchmarks AIME24, AMC23, MATH500, LiveCodeBench, and GPQA-Diamond. Coding evaluations HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E and CRUXEval have also joined the expanding list of available benchmarks.

Method	Model Name	AIME2024	MATH500	GPQA-Diamond
Evalchemy Eval	DeepSeek-R1-Distill-Qwen-7B	60.0	88.2	46.9
R1 Report	DeepSeek-R1-Distill-Qwen-7B	55.5	83.3	49.1
Evalchemy Eval	gpt-4o-2024-08-06	8.7	75.8	46.5
OpenAI Report	gpt-4o	9.3	60.3	50.6
Evalchemy Eval	o1-mini	64.0	85.6	60.0
OpenAI Report	o1-mini	-	90.0	60.0
Evalchemy Eval	DeepSeek-R1	86.7	91.6	71.2
R1 Report	DeepSeek-R1	79.8	97.3	71.5

In the table above we show our evaluation results for reasoning benchmarks on popular models compared to the publicly reported numbers.

Note: The AIME24 dataset has a small sample size, resulting in high variance in evaluation accuracy. To mitigate this, we updated the code to compute the average score over five evaluation runs with different seeds. No system prompt is used, the maximum token length is set to 32,768, and temperature = 0.7.

We are continuously improving Evalchemy. If there is a benchmark you would like to see added, please raise an issue on Github, or even better, open a pull request, as we encourage contributions from the community.

Citation

@misc{openthoughts,
  author = {Team, OpenThoughts},
  month = jan,
  title = {{Open Thoughts}},
  howpublished = {https://open-thoughts.ai},
  year = {2025}
}

@software{Evalchemy,
  author = {Guha, Etash and Raoof, Negin and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Grover, Sachin and Smyrnis, George and Vu, Trung and Marten, Ryan and Saad-Falcon, Jon and Choi, Caroline and Arora, Kushal and Merrill, Mike and Deng, Yichuan and Suvarna, Ashima and Bansal, Hritik and Nezhurina, Marianna and Choi, Yejin and Heckel, Reinhard and Oh, Seewong and Hashimoto, Tatsunori and Jitsev, Jenia and Shankar, Vaishaal and Schmidt, Ludwig and Dimakis, Alex and Sathiamoorthy, Mahesh},
  month = nov,
  title = {{Evalchemy}},
  year = {2024}
}

Measuring Reasoning with Evalchemy

Citation

Subscribe for updates