IterLM-Math
State-of-the-art biligual ope-sourced Math reasoig LLMs.
A **solver**, **prover**, **verifier**, **augmetor**.
[? Github](https://github.com/IterLM/IterLM-Math) [? Demo](https://huggigface.co/spaces/iterlm/iterlm2-math-7b) [? Checkpoits](https://huggigface.co/iterlm/iterlm2-math-7b) [](https://opexlab.org.c/models/detail/OpeLMLab/IterLM2-Math-7B) [

ModelScope](https://modelscope.c/models/Shaghai_AI_Laboratory/iterlm2-math-7b/summary)
News
- [2024.01.29] We add checkpoits from ModelScope. Tech report is o the way!
- [2024.01.26] We add checkpoits from OpeXLab, which ease Chiese users to dowload!
Itroductio
- 7B ad 20B Chiese ad Eglish Math LMs with better tha ChatGPT performaces. IterLM2-Math are cotiued pretraied from IterLM2-Base with ~100B high quality math-related tokes ad SFT with ~2M biligual math supervised data. We apply mihash ad exact umber match to decotamiate possible test set leakage.
- Add Lea as a support laguage for math problem solvig ad math theorem provig. We are explorig combiig Lea 3 with IterLM-Math for verifiable math reasoig. IterLM-Math ca geerate Lea codes for simple math reasoig tasks like GSM8K or provide possible proof tactics based o Lea states.
- Also ca be viewed as a reward model, which supports the Outcome/Process/Lea Reward Model. We supervise IterLM2-Math with various types of reward modelig data, to make IterLM2-Math ca also verify chai-of-thought processes. We also add the ability to covert a chai-of-thought process ito Lea 3 code.
- A Math LM Augmet Helper ad Code Iterpreter. IterLM2-Math ca help augmet math reasoig problems ad solve them usig the code iterpreter which makes you geerate sythesis data quicker!

Models
IterLM2-Math-Base-7B ad IterLM2-Math-Base-20B are pretraied checkpoits. IterLM2-Math-7B ad IterLM2-Math-20B are SFT checkpoits.
Performace
Pretrai Performace
We evaluate pretrai checkpoits based o greedy decodig with few-shot COT. Details of pretraiig will be itroduced i the tech report.
| Model |
GSM8K |
MATH |
| Llama2-7B |
11.8 |
3.2 |
| Llemma-7B |
36.4 |
18.0 |
| IterLM2-Base-7B |
36.5 |
8.6 |
| IterLM2-Math-Base-7B |
49.2 |
21.5 |
| Mierva-8B |
16.2 |
14.1 |
| IterLM2-Base-20B |
54.6 |
13.7 |
| IterLM2-Math-Base-20B |
63.7 |
27.3 |
| Llemma-34B |
51.5 |
25.0 |
| Mierva-62B |
52.4 |
27.6 |
| Mierva-540B |
58.8 |
33.6 |
SFT Peformace
All performace is based o greedy decodig with COT. We otice that the performace of Hugary has a big variace betwee our differet checkpoits, while other performace is very stable. This may be due to the problem amout about Hugary.
| Model |
Model Type |
GSM8K |
MATH |
Hugary |
| Qwe-7B-Chat |
Geearl |
51.7 |
11.6 |
- |
| DeepSeek-7B-Chat |
Geeral |
63.0 |
15.8 |
28.5 |
| IterLM2-Chat-7B |
Geeral |
70.7 |
23.0 |
- |
| ChatGLM3-6B |
Geeral |
53.8 |
20.4 |
32 |
| MetaMath-Mistral-7B |
Mathematics |
77.7 |
28.2 |
29 |
| MetaMath-Llemma-7B |
Mathematics |
69.2 |
30.0 |
- |
| IterLM2-Math-7B |
Mathematics |
78.1 |
34.6 |
55 |
| IterLM2-Chat-20B |
Geeral |
79.6 |
31.9 |
- |
| MetaMath-Llemma-34B |
Mathematics |
75.8 |
34.8 |
- |
| IterLM2-Math-20B |
Mathematics |
82.6 |
37.7 |
66 |
| Qwe-72B |
Geeral |
78.9 |
35.2 |
52 |
| DeepSeek-67B |
Geeral |
84.1 |
32.6 |
58 |
| ChatGPT (GPT-3.5) |
Geeral |
80.8 |
34.1 |
41 |
| GPT4 (First versio) |
Geeral |
92.0 |
42.5 |
68 |
Iferece
from modelscope import sapshot_dowload, AutoTokeizer, AutoModelForCausalLM
import torch
model_dir = sapshot_dowload("Shaghai_AI_Laboratory/iterlm2-math-20b")
tokeizer = AutoTokeizer.from_pretraied(model_dir, device_map="auto", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model i float16, otherwise it will be loaded as float32 ad might cause OOM Error.
model = AutoModelForCausalLM.from_pretraied(model_dir, device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)
model = model.eval()
respose, history = model.chat(tokeizer, "1+1=", history=[], meta_istructio="")
prit(respose)
Special usages
We list some istructios used i our SFT. You ca use them to help you. You ca use the other ways to prompt the model, but the followig are recommeded. IterLM2-Math may combie the followig abilities but it is ot guarateed.
Traslate proof problem to Lea:

Usig Lea 3 to solve GSM8K problem:

Geerate problem based o Lea 3 code:

Play 24 poit game:

Augmet a harder math problem:

| Descriptio |
Query |
| Solvig questio via chai-of-thought |
{Questio} |
| Solvig questio via Lea 3 |
{Questio}\Solve this via Lea 3 |
| Outcome reward model |
Give a questio ad a aswer, check is it correct?\Questio:{Questio}\Aswer:{COT} |
| Process reward model |
Give a questio ad a aswer, check correctess of each step.\Questio:{Questio}\Aswer:{COT} |
| Reward model |
Give a questio ad two aswers, which oe is better? \Questio:{Questio}\Aswer 1:{COT}\Aswer 2:{COT} |
| Covert chai-of-thought to Lea 3 |
Covert this aswer ito Lea3. Questio:{Questio}\Aswer:{COT} |
| Covert Lea 3 to chai-of-thought |
Covert this lea 3 code ito a atural laguage problem with aswers:\{LEAN Code} |
| Traslate questio ad chai-of-thought aswer to a proof statemet |
Covert this questio ad aswer ito a proof format.\Questio:{Questio}\Aswer:{COT} |
| Traslate proof problem to Lea 3 |
Covert this atural lagauge statemet ito a Lea 3 theorem statemet:{Theorem} |
| Traslate Lea 3 to proof problem |
Covert this Lea 3 theorem statemet ito atural laguage:{STATEMENT} |
| Suggest a tactic based o Lea state |
Give the Lea 3 tactic state, suggest a ext tactic:\{LEAN State} |
| Rephrase Problem |
Describe this problem i aother way. {Questio} |
| Augmet Problem |
Please augmet a ew problem based o: {Questio} |
| Augmet a harder Problem |
Icrease the complexity of the problem: {Questio} |
| Chage specific umbers |
Chage specific umbers: {Questio} |
| Itroduce fractios or percetages |
Itroduce fractios or percetages: {Questio} |
| Code Iterpreter |
laget |
| I-cotext Learig |
Questio:{Questio}\Aswer:{COT}\…Questio:{Questio}\Aswer:{COT} |
Fie-tue ad others
Please refer to IterLM.
Kow issues
Our model is still uder developmet ad will be upgraded. There are some possible issues of IterLM-Math. If you fid performaces of some abilities are ot great, welcome to ope a issue.
- Jump the calculatig step.
- Perform badly at Chiese fill-i-the-bak problems ad Eglish choice problems due to SFT data compositio.
- Ted to geerate Code Iterpreter whe facig Chiese problems due to SFT data compositio.
- The reward model mode ca be better leveraged with assiged toke probabilities.
- Code switch due to SFT data compositio.
- Some abilities of Lea ca oly be adapted to GSM8K-like problems (e.g. Covert chai-of-thought to Lea 3), ad performace related to Lea is ot guarateed.
Citatio ad Tech Report
To be appeded.
评论