Outperforming DeepSeekR1-32B with OpenThinker2

Today, we are releasing OpenThinker2-32B and OpenThinker2-7B, two new state of the art open-data reasoning models. Along with the models, we provide their training dataset OpenThoughts2-1M. Our models are simply trained with SFT on our curated data, without any RL.

Model	Data	AIME24	AIME25	AMC23	MATH500	GPQA-D	LCBv2
OpenThinker2-32B	✅	76.7	58.7	94.0	90.8	64.1	72.5
OpenThinker-32B	✅	68.0	49.3	95.5	90.6	63.5	68.6
R1-Distill-32B	❌	74.7	50.0	96.5	90.0	65.8	72.3
Light-R1-32B	✅	74.7	58.0	96.0	90.4	62.0	56.0
S1.1-32B	✅	59.3	42.7	91.5	87.4	62.0	58.7

When we launched the Open Thoughts project, our goal was to build a SFT dataset in the open and train a DeepSeek-R1-Distill-Qwen-32B level reasoning model. We have now achieved that goal, averaged over our reasoning evaluations and we outperform DeepSeek-R1-Distill-Qwen-32B.

We used two approaches to create OpenThoughts2-1M by adding to OpenThoughts-114K:

Leveraging existing reasoning data from the open source community
Sourcing and generating new code and math reasoning data

Model	Data	AIME24	AIME25	AMC23	MATH500	GPQA-D	LCBv2
OpenThinker2-7B	✅	50.0	33.3	89.5	88.4	49.3	55.6
OpenThinker-7B	✅	31.3	23.3	74.5	83.2	42.9	38.0
R1-Distill-7B	❌	57.3	33.3	92.0	89.6	47.3	48.4
OlympicCoder-7B	✅	20.7	15.3	63.0	74.8	25.3	55.4
OpenR1-Math-7B	✅	48.7	34.7	88.5	87.8	21.2	9.5

Leveraging Existing Reasoning Data

The open source community has released a flurry of reasoning datasets in the last two months. We aimed to build on OpenThoughts-114K by adding new data from new external datasets to achieve a greater diversity and scale. We finetuned the Qwen2.5-7B-Instruct model on GeneralThought-430K, OpenR1-Math, Llama-Nemotron-Post-Training-Dataset-v1, SYNTHETIC-1, KodCode-V1 and measured downstream performance on our reasoning evaluation suite. Out of the datasets that we used in these experiments, we found that OpenR1-Math performed the best overall so we include it in OpenThoughts2-1M.

Dataset	Rows	AIME24	AIME25	AMC23	MATH500	GPQA-D	LCBv2
OpenThoughts-114K	114k	31.3	28.0	72.0	84.4	42.9	41.8
GeneralThought-430K	430k	17.3	14.7	55.0	75.2	37.4	23.1
OpenR1-Math-Raw	669k	46.7	30.7	80.5	86.2	44.4	16.8
Nemotron	1M	8.7	1.3	38.0	61.0	32.8	18.5
SYNTHETIC-1	900k	18.7	16.7	61.0	77.2	37.9	29.0
KodCode-V1	484K	13.3	8.0	48.5	69.4	35.5	35.9

Generating New Reasoning Data

To further build upon the OpenThoughts-114K and OpenR1-Math mix, we generated additional math and code reasoning data. To do this, we try 26 different approaches for sourcing and generating math and code questions. For each strategy, we sample 5,000 questions, distill with DeepSeek-R1 and finetune Qwen-2.5-7B-Instruct on the resulting data.

To determine the best data sources, we measure the downstream performance of each model on relevant reasoning benchmarks. For code sources, we measure LiveCodeBenchV2. For math sources, we measure HumanEval, MATH500, AMC23, AIME24, GPQADiamond and LiveCodeBenchV2.

Code Data Source	LCBv2
CodeFeedback-Filtered-Instruction	17.6
Code-290k-ShareGPT	15.9
dolphin-coder	14.5
glaive-code-assistant-v3	13.5
Magicoder-OSS-Instruct-75K	11.7
opc-sft-stage2	8.6
rosetta-code	8.6
Magpie-Qwen2.5-Coder-Pro-300K	7.1
McEval-Instruct	6.3
sql-create-context-instruction	5.9
tiny-codes	5.1
self-oss-instruct-sc2-exec-filter-50k	4.9
commitpackft	4.3
react-code-instructions	3.5
ReflectionSeq-GPT	2.4

Math Data Source	Average
MathInstruct	25.1
MATH-plus	24.9
AutoMathInstruct (ours)	24.7
OpenMathInstruct-2	24.1
openmath-2-math	23.5
hendrycks-math-mc	23.3
math_instruct	20.5
named_math_formulas	20.4
Maths-College	19.2
math_dataset	14.1
math_qa	13.2
MathCoder	9.1

As seen in the tables above, the top performing math datasets are synthetic datasets from AutoMathInstruct, TigerLab, and Nvidia. We constructed AutoMathInstruct by searching for math related data within AutoMathText and using gpt-4o-mini to form related questions. The top performing code datasets are a mix of human coding questions (e.g. Code-290k-ShareGPT-Vicuna) and synthetic coding questions (e.g. CodeFeedback-Filtered-Instruction).

Using 30K questions from each of the top 5 data sources for code and 12.5k questions from each of the top 4 data sources for math on top of our OpenThoughts-114k + OpenR1-Math mix, we create our final OpenThoughts2-1M dataset.

OpenThoughts2-1M

OpenThoughts2 is a combination of OpenThoughts-114k, verified reasoning traces from OpenR1-Math, and the questions from our best math and code sources. This is visualized in the diagram below. Our full OpenThoughts2-1M dataset is released on HuggingFace. We will soon be adding the data generation code for OpenThoughts2-1M to our GitHub repository.

Evaluation Details

We evaluate OpenThinker2 on a set of popular reasoning benchmarks, running each benchmark multiple times (5x - AIME24, AIME25, AMC23 and 3x - LiveCodeBenchV2, GPQA-Diamond) and reporting the average accuracy. We set the temperature to 0.7 and the maximum token length to 32,768 during sampling. Our training datasets are decontaminated by removing samples with over 90% indel similarity against evaluation problems. All evaluations are conducted using our open-source framework Evalchemy, which we detailed in our previous post on reasoning evaluations.

Conclusion

OpenThoughts2-1M is a combination of OpenThoughts-114k, verified reasoning traces from OpenR1-Math, and our newly generated data. We finetune Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct on OpenThoughts2-1M to yield OpenThinker2-7B and OpenThinker2-32B. When compared with other reasoning models created from the same base, OpenThinker2-32B outperforms all other open-data models. Since all OpenThinker models have been trained only with SFT, we expect that RL post-training can further improve their performance.

We are excited for the research community to continue building together on these new reasoning models and datasets. If you have any questions or want to collaborate, feel free to raise an issue on our GitHub or reach out to us on X.

Citation

@misc{openthoughts,
  author = {Team, OpenThoughts},
  month = jan,
  title = {{Open Thoughts}},
  howpublished = {https://open-thoughts.ai},
  year = {2025}
}