Model Card for CodeFuse-CodeLlama-34B
[中文] [Eglish]
Model Descriptio
CodeFuse-CodeLlama-34B is a 34B Code-LLM fietued by QLoRA of multiple code tasks(600k istructios/aswers) o the base model CodeLlama-34b-Pytho.
The cotext legth of fietuig is 4K while it is able to be fietued by 16k cotext if ecessary.
News ad Updates
??? CodeFuse-CodeLlama34B-MFT has achived 74.4% of pass@1 o HumaEval, which is SOTA at preset.
Code Commuity
Homepage: ? https://github.com/codefuse-ai (Please give us your support with a Star? + Fork? + Watch?)
Performace
Model |
HumaEval(pass@1) |
Date |
CodeFuse-CodeLlama-34B |
74.4% |
2023.9 |
WizardCoder-Pytho-34B-V1.0 |
73.2% |
2023.8 |
GPT-4(zero-shot) |
67.0% |
2023.3 |
PaGu-Coder2 15B |
61.6% |
2023.8 |
CodeLlama-34b-Pytho |
53.7% |
2023.8 |
CodeLlama-34b |
48.8% |
2023.8 |
GPT-3.5(zero-shot) |
48.1% |
2022.11 |
OctoCoder |
46.2% |
2023.8 |
StarCoder-15B |
33.6% |
2023.5 |
LLaMA 2 70B(zero-shot) |
29.9% |
2023.7 |
Requiremets
- pytho>=3.8
- pytorch>=2.0.0
- trasformers==4.32.0
- Setecepiece
- CUDA 11.4
Iferece Strig Format
The iferece strig is a cocateated strig formed by combiig coversatio data(system, huma ad bot cotets) i the traiig data format. It is used as iput durig the iferece process.
Here is a example format of the cocateated strig:
"""
<|role_start|>system<|role_ed|>System istructio
<|role_start|>huma<|role_ed|>Huma 1st roud iput
<|role_start|>bot<|role_ed|>Bot 1st roud output</s>
<|role_start|>huma<|role_ed|>Huma 2d roud iput
<|role_start|>bot<|role_ed|>Bot 2d roud output</s>
...
...
...
<|role_start|>huma<|role_ed|>Huma th roud iput
<|role_start|>bot<|role_ed|>{Bot output to be gereated}</s>
"""
Whe applyig iferece, you always make your iput strig ed with "<|rolestart|>bot<|roleed|>" to ask the model geeratig aswers.
Quickstart
pip istall -r requiremets.txt
import torch
from modelscope import AutoTokeizer, AutoModelForCausalLM, sapshot_dowload
model_dir = sapshot_dowload('codefuse-ai/CodeFuse-CodeLlama-34B', revisio='v1.0.0')
tokeizer = AutoTokeizer.from_pretraied(model_dir, trust_remote_code=True, use_fast=False, legacy=False)
tokeizer.paddig_side = "left"
tokeizer.pad_toke_id = tokeizer.covert_tokes_to_ids("<uk>")
tokeizer.eos_toke_id = tokeizer.covert_tokes_to_ids("</s>")
model = AutoModelForCausalLM.from_pretraied(model_dir, trust_remote_code=True,
device_map='auto',
torch_dtype=torch.bfloat16)
HUMAN_ROLE_START_TAG = "<|role_start|>huma<|role_ed|>"
BOT_ROLE_START_TAG = "<|role_start|>bot<|role_ed|>"
text = f"{HUMAN_ROLE_START_TAG}write a pytho fuctio of quick sort.{BOT_ROLE_START_TAG}"
iputs = tokeizer(text, retur_tesors='pt', paddig=True, add_special_tokes=False).to("cuda")
outputs = model.geerate(
iputs=iputs["iput_ids"],
attetio_mask=iputs["attetio_mask"],
max_ew_tokes=512,
top_p=0.95,
temperature=0.1,
do_sample=True,
eos_toke_id=tokeizer.eos_toke_id,
pad_toke_id=tokeizer.pad_toke_id
)
ge_text = tokeizer.batch_decode(outputs[:, iputs["iput_ids"].shape[1]:], skip_special_tokes=True)
prit(ge_text)
MD5
We otice that the file may be corrupted durig trasfer process. Please check MD5 value before use.
Model File |
MD5 Value |
pytorch_model-00001-of-00007.bi |
8d544b1bcb3449934184d4141137329c |
pytorch_model-00002-of-00007.bi |
9d5dbb30911e48a42fb6d0fcabb322a4 |
pytorch_model-00003-of-00007.bi |
b0d4aecee0457d9332005a187e1fffed |
pytorch_model-00004-of-00007.bi |
5c7e002de5eab77d0194a2b0f6de0c24 |
pytorch_model-00005-of-00007.bi |
d22a511aa26b5b17117b665a877490ab |
pytorch_model-00006-of-00007.bi |
a5c28ac277fac07d16dd66537e54d109 |
pytorch_model-00007-of-00007.bi |
a967e2c6195477b7407089c0bffa2d53 |
模型简介
CodeFuse-CodeLlama34B-MFT 是一个通过QLoRA对基座模型CodeLlama-34b-Pytho进行多代码任务微调的代码大模型。模型微调采用了4k上下文。如果有必要,可以扩展到16k。
新闻
??? CodeFuse-CodeLlama34B-MFT模型在HumaEval pass@1上可以达到74.4%, 为当前开源SOTA。
代码社区
大本营: ? https://github.com/codefuse-ai (欢迎为我们的项目一键三连 Star?&bsp;+ Fork? + Watch?)
评测表现(代码)
模型 |
HumaEval(pass@1) |
日期 |
CodeFuse-CodeLlama-34B |
74.4% |
2023.9 |
WizardCoder-Pytho-34B-V1.0 |
73.2% |
2023.8 |
GPT-4(zero-shot) |
67.0% |
2023.3 |
PaGu-Coder2 15B |
61.6% |
2023.8 |
CodeLlama-34b-Pytho |
53.7% |
2023.8 |
CodeLlama-34b |
48.8% |
2023.8 |
GPT-3.5(zero-shot) |
48.1% |
2022.11 |
OctoCoder |
46.2% |
2023.8 |
StarCoder-15B |
33.6% |
2023.5 |
LLaMA 2 70B(zero-shot) |
29.9% |
2023.7 |
|
|
|
Requiremets
- pytho>=3.8
- pytorch>=2.0.0
- trasformers==4.32.0
- CUDA 11.4
推理数据格式
推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:
"""
<|role_start|>system<|role_ed|>这是System指令
<|role_start|>huma<|role_ed|>这是第1轮用户输入的问题
<|role_start|>bot<|role_ed|>这是第1轮模型生成的内容</s>
<|role_start|>huma<|role_ed|>这是第2轮用户输入的问题
<|role_start|>bot<|role_ed|>这是第2轮模型生成的内容</s>
...
...
...
<|role_start|>huma<|role_ed|>这是第轮用户输入的问题
<|role_start|>bot<|role_ed|>{模型现在要生成的内容}</s>
"""
推理时,请确保拼接的prompt字符串以"<|rolestart|>bot<|roleed|>"结尾,引导模型生成回答。
快速使用
import torch
from modelscope import AutoTokeizer, AutoModelForCausalLM, sapshot_dowload
model_dir = sapshot_dowload('codefuse-ai/CodeFuse-CodeLlama-34B', revisio='v1.0.0')
tokeizer = AutoTokeizer.from_pretraied(model_dir, trust_remote_code=True, use_fast=False, legacy=False)
tokeizer.paddig_side = "left"
tokeizer.pad_toke_id = tokeizer.covert_tokes_to_ids("<uk>")
tokeizer.eos_toke_id = tokeizer.covert_tokes_to_ids("</s>")
model = AutoModelForCausalLM.from_pretraied(model_dir, trust_remote_code=True,
device_map='auto',
torch_dtype=torch.bfloat16)
HUMAN_ROLE_START_TAG = "<|role_start|>huma<|role_ed|>"
BOT_ROLE_START_TAG = "<|role_start|>bot<|role_ed|>"
text = f"{HUMAN_ROLE_START_TAG}write a pytho fuctio of quick sort.{BOT_ROLE_START_TAG}"
iputs = tokeizer(text, retur_tesors='pt', paddig=True, add_special_tokes=False).to("cuda")
outputs = model.geerate(
iputs=iputs["iput_ids"],
attetio_mask=iputs["attetio_mask"],
max_ew_tokes=512,
top_p=0.95,
temperature=0.1,
do_sample=True,
eos_toke_id=tokeizer.eos_toke_id,
pad_toke_id=tokeizer.pad_toke_id
)
ge_text = tokeizer.batch_decode(outputs[:, iputs["iput_ids"].shape[1]:], skip_special_tokes=True)
prit(ge_text)
MD5
我们发现模型文件可能会在传输过程中损坏,使用前请检查文件MD5值。
模型文件 |
MD5值 |
pytorch_model-00001-of-00007.bi |
8d544b1bcb3449934184d4141137329c |
pytorch_model-00002-of-00007.bi |
9d5dbb30911e48a42fb6d0fcabb322a4 |
pytorch_model-00003-of-00007.bi |
b0d4aecee0457d9332005a187e1fffed |
pytorch_model-00004-of-00007.bi |
5c7e002de5eab77d0194a2b0f6de0c24 |
pytorch_model-00005-of-00007.bi |
d22a511aa26b5b17117b665a877490ab |
pytorch_model-00006-of-00007.bi |
a5c28ac277fac07d16dd66537e54d109 |
pytorch_model-00007-of-00007.bi |
a967e2c6195477b7407089c0bffa2d53 |
加入我们
我们是平台技术事业群AI Native团队,负责蚂蚁蚂蚁集团平台工程的智能化,团队成立3年多以来,支持了蚂蚁集团云计算基础设施智能化运维的升级改造。团队的Missio是,通过世界级的技术创新和影响,构建有广泛用户的算法服务和平台,支撑内外部产品和业务落地。团队秉承创新基因,在支撑业务落地的同时,推动技术影响。3年以来在ICLR、NeurIPS、KDD、ACL等顶会发表论文20余篇,创新业务结果获得两次蚂蚁技术最高奖T-Star,1次蚂蚁集团最高奖SuperMA。开源项目CodeFuse获得4K点赞(2024年2月),Huggigface和modelscope上模型累积下载量超过150万次。
我们正在寻找行业中的佼佼者加入我们的团队!如果您希望在一个充满活力、创新和卓越文化的环境中发展您的职业生涯,欢迎您查看我们的社招&校招机会,加入我们,一起创造下一个行业里程碑。
校招:https://hrrecommed.atgroup.com/guide.html?code=8uoP5mlus5DqQYbEEqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GL7
社招:https://talet.atgroup.com/off-campus-positio?positioId=1933830
评论