IterLM-Math-Plus
State-of-the-art biligual ope-sourced Math reasoig LLMs.
A **solver**, **prover**, **verifier**, **augmetor**.
[? Github](https://github.com/IterLM/IterLM-Math) [? Demo](https://huggigface.co/spaces/iterlm/iterlm2-math-7b)
News
- [2024.05.24] We release updated versio IterLM2-Math-Plus with 4 sizes ad state-of-the-art performaces icludig 1.8B, 7B, 20B, ad 8x22B. We improve iformal math reasoig performace (chai-of-thought ad code-itepreter) ad formal math reasoig performace (LEAN 4 traslatio ad LEAN 4 theorem provig) sigificatly.
- [2024.02.10] We add tech reports ad citatio referece.
- [2024.01.31] We add MiiF2F results with evaluatio codes!
- [2024.01.29] We add checkpoits from ModelScope. Update results about majority votig ad Code Itepreter. Tech report is o the way!
- [2024.01.26] We add checkpoits from OpeXLab, which ease Chiese users to dowload!
Performace
Formal Math Reasoig
We evaluate the performace of IterLM2-Math-Plus o formal math reasoig bechmark MiiF2F-test. The evaluatio settig is same as Llemma with LEAN 4.
| Models |
MiiF2F-test |
| ReProver |
26.5 |
| LLMStep |
27.9 |
| GPT-F |
36.6 |
| HTPS |
41.0 |
| Llemma-7B |
26.2 |
| Llemma-34B |
25.8 |
| IterLM2-Math-7B-Base |
30.3 |
| IterLM2-Math-20B-Base |
29.5 |
| IterLM2-Math-Plus-1.8B |
38.9 |
| IterLM2-Math-Plus-7B |
43.4 |
| IterLM2-Math-Plus-20B |
42.6 |
| IterLM2-Math-Plus-Mixtral8x22B |
37.3 |
Iformal Math Reasoig
We evaluate the performace of IterLM2-Math-Plus o iformal math reasoig bechmark MATH ad GSM8K. IterLM2-Math-Plus-1.8B outperforms MiiCPM-2B i the smallest size settig. IterLM2-Math-Plus-7B outperforms Deepseek-Math-7B-RL which is the state-of-the-art math reasoig ope source model. IterLM2-Math-Plus-Mixtral8x22B achieves 68.5 o MATH (with Pytho) ad 91.8 o GSM8K.
| Model |
MATH |
MATH-Pytho |
GSM8K |
| MiiCPM-2B |
10.2 |
- |
53.8 |
| IterLM2-Math-Plus-1.8B |
37.0 |
41.5 |
58.8 |
| IterLM2-Math-7B |
34.6 |
50.9 |
78.1 |
| Deepseek-Math-7B-RL |
51.7 |
58.8 |
88.2 |
| IterLM2-Math-Plus-7B |
53.0 |
59.7 |
85.8 |
| IterLM2-Math-20B |
37.7 |
54.3 |
82.6 |
| IterLM2-Math-Plus-20B |
53.8 |
61.8 |
87.7 |
| Mixtral8x22B-Istruct-v0.1 |
41.8 |
- |
78.6 |
| Eurux-8x22B-NCA |
49.0 |
- |
- |
| IterLM2-Math-Plus-Mixtral8x22B |
58.1 |
68.5 |
91.8 |
We also evaluate models o MathBech-A. IterLM2-Math-Plus-Mixtral8x22B has comparable performace compared to Claude 3 Opus.
| Model |
Arithmetic |
Primary |
Middle |
High |
College |
Average |
| GPT-4o-0513 |
77.7 |
87.7 |
76.3 |
59.0 |
54.0 |
70.9 |
| Claude 3 Opus |
85.7 |
85.0 |
58.0 |
42.7 |
43.7 |
63.0 |
| Qwe-Max-0428 |
72.3 |
86.3 |
65.0 |
45.0 |
27.3 |
59.2 |
| Qwe-1.5-110B |
70.3 |
82.3 |
64.0 |
47.3 |
28.0 |
58.4 |
| Deepseek-V2 |
82.7 |
89.3 |
59.0 |
39.3 |
29.3 |
59.9 |
| Llama-3-70B-Istruct |
70.3 |
86.0 |
53.0 |
38.7 |
34.7 |
56.5 |
| IterLM2-Math-Plus-Mixtral8x22B |
77.5 |
82.0 |
63.6 |
50.3 |
36.8 |
62.0 |
| IterLM2-Math-20B |
58.7 |
70.0 |
43.7 |
24.7 |
12.7 |
42.0 |
| IterLM2-Math-Plus-20B |
65.8 |
79.7 |
59.5 |
47.6 |
24.8 |
55.5 |
| Llama3-8B-Istruct |
54.7 |
71.0 |
25.0 |
19.0 |
14.0 |
36.7 |
| IterLM2-Math-7B |
53.7 |
67.0 |
41.3 |
18.3 |
8.0 |
37.7 |
| Deepseek-Math-7B-RL |
68.0 |
83.3 |
44.3 |
33.0 |
23.0 |
50.3 |
| IterLM2-Math-Plus-7B |
61.4 |
78.3 |
52.5 |
40.5 |
21.7 |
50.9 |
| MiiCPM-2B |
49.3 |
51.7 |
18.0 |
8.7 |
3.7 |
26.3 |
| IterLM2-Math-Plus-1.8B |
43.0 |
43.3 |
25.4 |
18.9 |
4.7 |
27.1 |
Citatio ad Tech Report
@misc{yig2024iterlmmath,
title={IterLM-Math: Ope Math Large Laguage Models Toward Verifiable Reasoig},
author={Huaiyua Yig ad Shuo Zhag ad Liyag Li ad Zhejia Zhou ad Yufa Shao ad Zhaoye Fei ad Yichua Ma ad Jiawei Hog ad Kuiku Liu ad Ziyi Wag ad Yudog Wag ad Zijia Wu ad Shuaibi Li ad Fegzhe Zhou ad Hogwei Liu ad Sogyag Zhag ad Wewei Zhag ad Hag Ya ad Xipeg Qiu ad Jiayu Wag ad Kai Che ad Dahua Li},
year={2024},
eprit={2402.06332},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
评论