Scaling up Open Reasoning with OpenThinker-32B

We release OpenThinker-32B, a state-of-the-art open-data reasoning model. We show that powerful reasoning models can be trained by scaling data, verifying reasoning traces, and scaling model size. OpenThinker-32B outperforms existing open-data reasoning models on a host of reasoning benchmarks including math, code, and science.

Model	Dataset	Open	AIME24	AIME25 I	MATH500	GPQA-D	LCBv2
LIMO-32B	0.8k	✅	56.7	49.3	86.6	58.1	60.0
s1-32B	1k	✅	36.0	25.3	84.8	50.5	40.9
s1.1-32B	1k	✅	64.7	49.3	89.0	60.1	65.5
R1-Distill-32B	800k	❌	76.7	55.9	89.4	57.6	71.2
OpenThinker-32B	114k	✅	66.0	53.3	90.6	61.6	68.9

All reported numbers are computed using our open source evaluation framework Evalchemy. Note that R1-Distill-32B is a closed data model, which was finetuned from Qwen-2.5-32B-Instruct on a dataset of size 800k, reportedly containing 600k reasoning samples.

Data Curation

We train OpenThinker-32B on the same OpenThoughts-114k dataset as our earlier model OpenThinker-7B. Using DeepSeek-R1, we collected reasoning traces and solution attempts for a curated mix of 173k questions. We are now releasing this raw data as the OpenThoughts-Unverfied-173k dataset. The last step in the pipeline is filtering out samples if the reasoning trace fails verification. The full code which we used to construct our dataset is available on the open-thoughts GitHub repository.

As requested by the community, we have updated the final OpenThoughts-114k dataset to contain a "metadata" configuration that includes separate columns for:

problem
ground_truth_solution
test_cases (code only)
starter_code (code only)
deepseek_reasoning
deepseek_solution
domain
source

The additional metadata will make it easier to use this dataset in new ways such as filtering, swapping out domains, checking verification, and changing the reasoning trace templating.

load_dataset("open-thoughts/OpenThoughts-114k", "metadata", split="train")

We are also excited to see the community use the problems and ground truth solutions for RL on top of the OpenThinker models, which DeepScaleR has shown to work particularly well at a smaller scale.

Verification

To obtain the final OpenThoughts-114k dataset, we verify the answers and eliminate incorrect responses. As shown below, keeping the reasoning traces that fail verification can harm performance, though the unverified model still performs well compared to other 32B reasoning models. Verification serves to maintain the quality of the R1 annotations while scaling up the diversity and size of the set of training prompts. On the other hand, unverified data can be scaled more easily, which makes it worth further exploring as well.

Model	Dataset	AIME24	AIME25 I	MATH500	GPQA-D	LCBv2
OpenThinker-7B	114k	31.3	30.7	84.4	38.9	41.8
OpenThinker-7B-Unverified	173k	34.0	29.33	83.0	39.4	43.8
OpenThinker-32B	114k	66.7	53.3	90.6	61.6	68.9
OpenThinker-32B-Unverified	173k	60.7	44.0	90.0	60.6	69.2

Reasoning traces for code problems are verified by checking the solution attempt against existing test cases. Inspired by the challenges faced during code execution, we implemented a code execution framework in Curator that enables users to scalably and securely execute code and verify against expected outputs. Math verification is determined by an LLM judge given the ground truth solution and DeepSeek-R1 solution attempt. We found that using an LLM judge instead of a stricter parsing engine (Math-Verify) for verification during data generation results in a higher yield and leads to higher performing downstream models.

Data Generation Verifier	Dataset	Evaluation Verifier	AIME25 I	MATH500
LLM judge (OpenThinker-7B)	114k	Hendrycks (default)	31.3	84.4
LLM judge (OpenThinker-7B)	83k	Math-Verify	44.0	89.0
Math-Verify	114k	Hendrycks (default)	23.0	55.0
Math-Verify	83k	Math-Verify	22.7	82.2

Training

We finetune Qwen2.5-32B-Instruct on OpenThoughts-114k for 3 epochs with a 16k context length using LLaMa-Factory. Our full training configuration is provided in our repository. OpenThinker-32B was trained using four 8xH100 P5 nodes over a period of 90 hours, totalling 2,880 H100 hours on Toyota Research Institute's AWS SageMaker cluster. Meanwhile, OpenThinker-32B-Unverified was trained using 96 nodes of 4xA100 (64 GB per GPU) over a period of 30 hours, totaling 11,520 A100 hours on the Leonardo Supercomputer.

Evaluation

We evaluate all models using our open source evaluation library Evalchemy. For AIME24 and AIME25, we average the results of five runs to compute accuracy. Our evaluation configuration uses 0.7 as the temperature, restricts the model response to 32,768 tokens, does not add any additional system or user prompts and does not use any special decoding strategy (e.g. budget forcing).

When we launched the OpenThoughts project, we set a goal to create an open-data model that matches the performance of DeepSeek-R1-Distill-Qwen-32B. This gap is now almost closed. We are excited by the rapid progress in the community over the last few weeks in building open-data reasoning models and look forward to continuing building on each other's insights.

Citation

@misc{guha2025openthoughtsdatarecipesreasoning,
      title={OpenThoughts: Data Recipes for Reasoning Models}, 
      author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
      year={2025},
      eprint={2506.04178},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.04178}, 
}