Model Card for CodeFuse-CodeLlama-34B-4bits
[中文] [Eglish]
Model Descriptio
CodeFuse-CodeLlama-34B-4bits is the 4-bit quatized versio of CodeFuse-CodeLlama-34B, which is a 34B Code-LLM fie-tued over multiple code tasks(600k istructios/aswers)o the base model CodeLlama-34b-Pytho.
After udergoig 4-bit quatizatio, the CodeFuse-CodeLlama-34B-4bits model ca be loaded o either a sigle A10 (24GB VRAM) or a RTX 4090 (24GB VRAM). Moreover, the quatized model still achives a impressive accuracy of 73.8% o the Humaeval pass@1 metric.
News ad Updates
??? 2023-09-28 CodeFuse-CodeLlama-34B 4-bit techical documetatio has bee released. If you are iterested, please click the provided lik to view it o the CodeFuse WeChat official accout.(https://mp.weixi.qq.com/s/QLycLdgPGQjF7JE_YF466Q)
??? 2023-09-26 We are pleased to aouce the release of the 4-bit quatized versio of CodeFuse-CodeLlama-34B. Despite the quatizatio process, the model still achieves a remarkable 73.8% accuracy (greedy decodig) o the HumaEval pass@1 metric.
??? 2023-09-11 CodeFuse-CodeLlama34B has achived 74.4% of pass@1 (greedy decodig) o HumaEval, which is SOTA results for opespurced LLMs at preset.
Code Commuity
Homepage: ? https://github.com/codefuse-ai (Please give us your support with a Star? + Fork? + Watch?)
Performace
Code
Model |
HumaEval(pass@1) |
Date |
CodeFuse-CodeLlama-34B |
74.4% |
2023.9 |
CodeFuse-CodeLlama-34B-4bits |
73.8% |
2023.9 |
WizardCoder-Pytho-34B-V1.0 |
73.2% |
2023.8 |
GPT-4(zero-shot) |
67.0% |
2023.3 |
PaGu-Coder2 15B |
61.6% |
2023.8 |
CodeLlama-34b-Pytho |
53.7% |
2023.8 |
CodeLlama-34b |
48.8% |
2023.8 |
GPT-3.5(zero-shot) |
48.1% |
2022.11 |
OctoCoder |
46.2% |
2023.8 |
StarCoder-15B |
33.6% |
2023.5 |
LLaMA 2 70B(zero-shot) |
29.9% |
2023.7 |
GPU Memory Usage
We measured the GPU memory usage after loadig the model, as well as the memory usage whe ecodig 2048/1024 tokes ad geeratig 1024/2048 tokes. The results are preseted i the table below.
Precisio |
Idle Model |
Ecodig 2048 tokes ad Geeratig 1024 tokes |
Ecodig 1024 tokes ad Geeratig 2048 tokes |
bfloat16 |
64.89GB |
69.31GB |
66.41GB |
it4 |
19.09GB |
22.19GB |
20.78GB |
Requiremets
- pytho>=3.8
- pytorch>=2.0.0
- trasformers==4.32.0
- auto_gptq==0.4.2
- Setecepiece
- CUDA 11.4
Iferece Strig Format
The iferece strig is a cocateated strig formed by combiig coversatio data (huma ad bot cotets) i the traiig data format. It is used as iput durig the iferece process.
Here is a example format of the cocateated strig:
"""
<|role_start|>huma<|role_ed|>Huma 1st roud iput
<|role_start|>bot<|role_ed|>Bot 1st roud output</s>
<|role_start|>huma<|role_ed|>Huma 2d roud iput
<|role_start|>bot<|role_ed|>Bot 2d roud output</s>
...
...
...
<|role_start|>huma<|role_ed|>Huma th roud iput
<|role_start|>bot<|role_ed|>{Bot output to be gereated}</s>
"""
Whe applyig iferece, you always make your iput strig ed with "<|rolestart|>bot<|roleed|>" to ask the model geeratig aswers.
Quickstart
```bash
git cloe https://www.modelscope.c/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git
bash
pip istall -r requiremets.txt
pytho
import os
import torch
import time
from modelscope import AutoTokeizer, sapshotdowload
from autogptq import AutoGPTQForCausalLM
os.eviro["TOKENIZERS_PARALLELISM"] = "false"
def loadmodeltokeizer(modelpath):
"""
Load model ad tokeizer based o the give model ame or local path of dowloaded model.
"""
tokeizer = AutoTokeizer.frompretraied(modelpath,
trustremotecode=True,
usefast=False,
lagecy=False)
tokeizer.paddigside = "left"
tokeizer.padtokeid = tokeizer.coverttokestoids("")
tokeizer.eostokeid = tokeizer.coverttokesto_ids("")
model = AutoGPTQForCausalLM.from_quatized(model_path,
iject_fused_attetio=False,
iject_fused_mlp=False,
use_safetesors=False,
use_cuda_fp16=True,
disable_exllama=False,
device_map='auto' # Support multi-gpus
)
retur model, tokeizer
def iferece(model, tokeizer, prompt):
"""
Uset the give model ad tokeizer to geerate a aswer for the speicifed prompt.
"""
st = time.time()
prompt = prompt if prompt.edswith('\') else f'{prompt}\'
iputs = f"<|rolestart|>huma<|roleed|>{prompt}<|rolestart|>bot<|roleed|>"
iput_ids = tokeizer.ecode(iputs,
retur_tesors="pt",
paddig=True,
add_special_tokes=False).to("cuda")
with torch.o_grad():
geerated_ids = model.geerate(
iput_ids=iput_ids,
top_p=0.95,
temperature=0.1,
do_sample=True,
max_ew_tokes=512,
eos_toke_id=tokeizer.eos_toke_id,
pad_toke_id=tokeizer.pad_toke_id
)
prit(f'geerated tokes um is {le(geerated_ids[0][iput_ids.size(1):])}')
outputs = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True)
prit(f'geerate text is {outputs[0][le(iputs): ]}')
latecy = time.time() - st
prit('latecy is {} secods'.format(latecy))
if ame == "mai":
modeldir = sapshotdowload('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revisio='v1.0.0')
prompt = 'Please write a QuickSort program i Pytho'
model, tokeizer = load_model_tokeizer(model_dir)
iferece(model, tokeizer, prompt)
**The curret iferece example code is based o [AutoGPTQ](https://github.com/PaQiWei/AutoGPTQ). If you wat to achieve higher iferece speed, it is recommeded to combie it with [TesorRT-LLM (Early Access)](https://developer.vidia.com/tesorrt-llm-early-access).**
<br>
## Cosistecy Check
Here, SHA256 values are provided for the model-related files for cosistecy check durig the dowload.
| File | SHA256 |
|-------------------------------:|:--------------------------------:|
|cofig.jso | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b |
|geeratio_cofig.jso | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 |
|gptq_model-4bit-64g.bi | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a |
|pytorch_model.bi.idex.jso | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d |
|quatize_cofig.jso | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 |
|special_tokes_map.jso | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 |
|tokeizer.jso | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 |
|tokeizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 |
|tokeizer_cofig.jso | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 |
<br>
<br>
<a id="chiese"></a>
## 模型简介
CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Pytho进行多代码任务微调而得到的代码大模型,模型输入长度为4K。
经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaeval pass@1指标上仍取得了73.8%的表现。
<br>
## 新闻
??? 2023-09-28 CodeFuse-CodeLlama-34B 4bits技术文档公布,感兴趣请点击微信公众号CodeFuse链接查看: https://mp.weixi.qq.com/s/QLycLdgPGQjF7JE_YF466Q
??? 2023-09-26 CodeFuse-CodeLlama-34B 4bits量化版本发布,量化后模型在HumaEval pass@1指标为73.8% (贪婪解码)。
??? 2023-09-11 CodeFuse-CodeLlama-34B发布,HumaEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。
<br>
## 代码社区
**大本营**: ? https://github.com/codefuse-ai (**请支持我们的项目Star? + Fork? + Watch?**)
+ 如果您想自己微调该模型,可以访问 ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨
+ 如果您想自己部署该模型,可以访问 ✨[FasterTrasformer4CodeFuse](https://github.com/codefuse-ai/FasterTrasformer4CodeFuse)✨✨
+ 如果您想观看该模型示例,可以访问 ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨
<br>
## 评测表现(代码)
| 模型 | HumaEval(pass@1) | 日期 |
|:--------------------------------|:-----------------:|:-------:|
| **CodeFuse-CodeLlama-34B** | **74.4%** | 2023.9 |
|**CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023.9 |
| WizardCoder-Pytho-34B-V1.0 | 73.2% | 2023.8 |
| GPT-4(zero-shot) | 67.0% | 2023.3 |
| PaGu-Coder2 15B | 61.6% | 2023.8 |
| CodeLlama-34b-Pytho | 53.7% | 2023.8 |
| CodeLlama-34b | 48.8% | 2023.8 |
| GPT-3.5(zero-shot) | 48.1% | 2022.11 |
| OctoCoder | 46.2% | 2023.8 |
| StarCoder-15B | 33.6% | 2023.5 |
| LLaMA 2 70B(zero-shot) | 29.9% | 2023.7 |
<br>
## 显存使用
我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokes并输出1024/2048 tokes时的显存使用情况,如下表所示
| 精度 | 模型空载 | 输入2048 tokes + 输出1024 tokes | 输入1024 tokes + 输出2048 tokes |
|:--------------------------------|:-------------------|:------------------------:|:------------:|
|bfloat16 | 64.89GB | 69.31GB | 66.41GB |
|it4 | 19.09GB | 22.19GB | 20.78GB |
<br>
## 依赖要求
* pytho>=3.8
* pytorch>=2.0.0
* trasformers==4.32.0
* auto_gptq==0.4.2
* Setecepiece
* CUDA 11.4
<br>
## 推理数据格式
推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:
pytho
"""
<|rolestart|>huma<|roleed|>Huma 1st roud iput
<|rolestart|>bot<|roleed|>Bot 1st roud output
<|rolestart|>huma<|roleed|>Huma 2d roud iput
<|rolestart|>bot<|roleed|>Bot 2d roud output
…
…
…
<|ed|><|rolestart|>huma<|roleed|>Huma th roud iput
<|ed|><|rolestart|>bot<|roleed|>{Bot output to be gereated}
"""
推理时,请确保拼接的prompt字符串以"<|role_start|>bot<|role_ed|>"结尾,引导模型生成回答。
<br>
## 快速使用
bash
git cloe https://www.modelscope.c/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git
bash
pip istall -r requiremets.txt
pytho
import os
import torch
import time
from modelscope import AutoTokeizer, sapshotdowload
from autogptq import AutoGPTQForCausalLM
os.eviro["TOKENIZERS_PARALLELISM"] = "false"
def loadmodeltokeizer(modelpath):
"""
Load model ad tokeizer based o the give model ame or local path of dowloaded model.
"""
tokeizer = AutoTokeizer.frompretraied(modelpath,
trustremotecode=True,
usefast=False,
lagecy=False)
tokeizer.paddigside = "left"
tokeizer.padtokeid = tokeizer.coverttokestoids("")
tokeizer.eostokeid = tokeizer.coverttokesto_ids("")
model = AutoGPTQForCausalLM.from_quatized(model_path,
iject_fused_attetio=False,
iject_fused_mlp=False,
use_safetesors=False,
use_cuda_fp16=True,
disable_exllama=False,
device_map='auto' # 支持多卡
)
retur model, tokeizer
def iferece(model, tokeizer, prompt):
"""
Uset the give model ad tokeizer to geerate a aswer for the speicifed prompt.
"""
st = time.time()
prompt = prompt if prompt.edswith('\') else f'{prompt}\'
iputs = f"<|rolestart|>huma<|roleed|>{prompt}<|rolestart|>bot<|roleed|>"
iput_ids = tokeizer.ecode(iputs,
retur_tesors="pt",
paddig=True,
add_special_tokes=False).to("cuda")
with torch.o_grad():
geerated_ids = model.geerate(
iput_ids=iput_ids,
top_p=0.95,
temperature=0.1,
do_sample=True,
max_ew_tokes=512,
eos_toke_id=tokeizer.eos_toke_id,
pad_toke_id=tokeizer.pad_toke_id
)
prit(f'geerated tokes um is {le(geerated_ids[0][iput_ids.size(1):])}')
outputs = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True)
prit(f'geerate text is {outputs[0][le(iputs): ]}')
latecy = time.time() - st
prit('latecy is {} secods'.format(latecy))
if ame == "mai":
modeldir = sapshotdowload('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revisio='v1.0.0')
prompt = '请用Pytho实现一个快速排序算法'
model, tokeizer = load_model_tokeizer(model_dir)
iferece(model, tokeizer, prompt)
```
目前的推理示例代码是基于AutoGPTQ的,如果你想获取更高的推理速度,建议结合使用TesorRT-LLM (Early Access)。
一致性校验
这里提供了模型相关文件的SHA256值,用于下载一致性校验。
文件 |
SHA256 |
cofig.jso |
bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b |
geeratio_cofig.jso |
b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 |
gptq_model-4bit-64g.bi |
79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a |
pytorch_model.bi.idex.jso |
9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d |
quatize_cofig.jso |
3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 |
specialtokesmap.jso |
ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 |
tokeizer.jso |
f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 |
tokeizer.model |
9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 |
tokeizer_cofig.jso |
c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 |
加入我们
我们是平台技术事业群AI Native团队,负责蚂蚁蚂蚁集团平台工程的智能化,团队成立3年多以来,支持了蚂蚁集团云计算基础设施智能化运维的升级改造。团队的Missio是,通过世界级的技术创新和影响,构建有广泛用户的算法服务和平台,支撑内外部产品和业务落地。团队秉承创新基因,在支撑业务落地的同时,推动技术影响。3年以来在ICLR、NeurIPS、KDD、ACL等顶会发表论文20余篇,创新业务结果获得两次蚂蚁技术最高奖T-Star,1次蚂蚁集团最高奖SuperMA。开源项目CodeFuse获得4K点赞(2024年2月),Huggigface和modelscope上模型累积下载量超过150万次。
我们正在寻找行业中的佼佼者加入我们的团队!如果您希望在一个充满活力、创新和卓越文化的环境中发展您的职业生涯,欢迎您查看我们的社招&校招机会,加入我们,一起创造下一个行业里程碑。
校招:https://hrrecommed.atgroup.com/guide.html?code=8uoP5mlus5DqQYbEEqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GL7
社招:https://talet.atgroup.com/off-campus-positio?positioId=1933830
评论