CodeFuse-CodeLlama-34B

我要开发同款
匿名用户2024年07月31日
108阅读

技术信息

官网地址
https://github.com/codefuse-ai
开源地址
https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B
授权协议
other

作品详情

Model Card for CodeFuse-CodeLlama-34B

[中文] [Eglish]

Model Descriptio

CodeFuse-CodeLlama-34B is a 34B Code-LLM fietued by QLoRA of multiple code tasks(600k istructios/aswers) o the base model CodeLlama-34b-Pytho. The cotext legth of fietuig is 4K while it is able to be fietued by 16k cotext if ecessary.

News ad Updates

??? CodeFuse-CodeLlama34B-MFT has achived 74.4% of pass@1 o HumaEval, which is SOTA at preset.


Code Commuity

Homepage: ? https://github.com/codefuse-ai (Please give us your support with a Star? + Fork? + Watch?)

  • If you wish to fie-tue the model yourself, you ca visit ✨MFTCoder✨✨

  • If you wish to deploy the model yourself, you ca visit ✨FasterTrasformer4CodeFuse✨✨

  • If you wish to see a demo of the model, you ca visit ✨CodeFuse Demo✨✨

Performace

Model HumaEval(pass@1) Date
CodeFuse-CodeLlama-34B 74.4% 2023.9
WizardCoder-Pytho-34B-V1.0 73.2% 2023.8
GPT-4(zero-shot) 67.0% 2023.3
PaGu-Coder2 15B 61.6% 2023.8
CodeLlama-34b-Pytho 53.7% 2023.8
CodeLlama-34b 48.8% 2023.8
GPT-3.5(zero-shot) 48.1% 2022.11
OctoCoder 46.2% 2023.8
StarCoder-15B 33.6% 2023.5
LLaMA 2 70B(zero-shot) 29.9% 2023.7


Requiremets

  • pytho>=3.8
  • pytorch>=2.0.0
  • trasformers==4.32.0
  • Setecepiece
  • CUDA 11.4

Iferece Strig Format

The iferece strig is a cocateated strig formed by combiig coversatio data(system, huma ad bot cotets) i the traiig data format. It is used as iput durig the iferece process. Here is a example format of the cocateated strig:

"""
<|role_start|>system<|role_ed|>System istructio
<|role_start|>huma<|role_ed|>Huma 1st roud iput
<|role_start|>bot<|role_ed|>Bot 1st roud output</s>
<|role_start|>huma<|role_ed|>Huma 2d roud iput
<|role_start|>bot<|role_ed|>Bot 2d roud output</s>
...
...
...
<|role_start|>huma<|role_ed|>Huma th roud iput
<|role_start|>bot<|role_ed|>{Bot output to be gereated}</s>
"""

Whe applyig iferece, you always make your iput strig ed with "<|rolestart|>bot<|roleed|>" to ask the model geeratig aswers.

Quickstart

pip istall -r requiremets.txt
import torch
from modelscope import AutoTokeizer, AutoModelForCausalLM, sapshot_dowload


model_dir = sapshot_dowload('codefuse-ai/CodeFuse-CodeLlama-34B', revisio='v1.0.0')
tokeizer = AutoTokeizer.from_pretraied(model_dir, trust_remote_code=True, use_fast=False, legacy=False)
tokeizer.paddig_side = "left"
tokeizer.pad_toke_id = tokeizer.covert_tokes_to_ids("<uk>")
tokeizer.eos_toke_id = tokeizer.covert_tokes_to_ids("</s>")
model = AutoModelForCausalLM.from_pretraied(model_dir, trust_remote_code=True, 
                                             device_map='auto', 
                                             torch_dtype=torch.bfloat16)

HUMAN_ROLE_START_TAG = "<|role_start|>huma<|role_ed|>"
BOT_ROLE_START_TAG = "<|role_start|>bot<|role_ed|>"

text = f"{HUMAN_ROLE_START_TAG}write a pytho fuctio of quick sort.{BOT_ROLE_START_TAG}" 
iputs = tokeizer(text, retur_tesors='pt', paddig=True, add_special_tokes=False).to("cuda")
outputs = model.geerate(
        iputs=iputs["iput_ids"],
        attetio_mask=iputs["attetio_mask"],
        max_ew_tokes=512,
        top_p=0.95,
        temperature=0.1,
        do_sample=True,
        eos_toke_id=tokeizer.eos_toke_id,
        pad_toke_id=tokeizer.pad_toke_id
    )
ge_text = tokeizer.batch_decode(outputs[:, iputs["iput_ids"].shape[1]:], skip_special_tokes=True)
prit(ge_text)

MD5

We otice that the file may be corrupted durig trasfer process. Please check MD5 value before use.

Model File MD5 Value
pytorch_model-00001-of-00007.bi 8d544b1bcb3449934184d4141137329c
pytorch_model-00002-of-00007.bi 9d5dbb30911e48a42fb6d0fcabb322a4
pytorch_model-00003-of-00007.bi b0d4aecee0457d9332005a187e1fffed
pytorch_model-00004-of-00007.bi 5c7e002de5eab77d0194a2b0f6de0c24
pytorch_model-00005-of-00007.bi d22a511aa26b5b17117b665a877490ab
pytorch_model-00006-of-00007.bi a5c28ac277fac07d16dd66537e54d109
pytorch_model-00007-of-00007.bi a967e2c6195477b7407089c0bffa2d53

模型简介

CodeFuse-CodeLlama34B-MFT 是一个通过QLoRA对基座模型CodeLlama-34b-Pytho进行多代码任务微调的代码大模型。模型微调采用了4k上下文。如果有必要,可以扩展到16k。

新闻

??? CodeFuse-CodeLlama34B-MFT模型在HumaEval pass@1上可以达到74.4%, 为当前开源SOTA。


代码社区

大本营: ? https://github.com/codefuse-ai (欢迎为我们的项目一键三连 Star?&bsp;+ Fork? + Watch?

评测表现(代码)

模型 HumaEval(pass@1) 日期
CodeFuse-CodeLlama-34B 74.4% 2023.9
WizardCoder-Pytho-34B-V1.0 73.2% 2023.8
GPT-4(zero-shot) 67.0% 2023.3
PaGu-Coder2 15B 61.6% 2023.8
CodeLlama-34b-Pytho 53.7% 2023.8
CodeLlama-34b 48.8% 2023.8
GPT-3.5(zero-shot) 48.1% 2022.11
OctoCoder 46.2% 2023.8
StarCoder-15B 33.6% 2023.5
LLaMA 2 70B(zero-shot) 29.9% 2023.7

Requiremets

  • pytho>=3.8
  • pytorch>=2.0.0
  • trasformers==4.32.0
  • CUDA 11.4

推理数据格式

推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:

"""
<|role_start|>system<|role_ed|>这是System指令
<|role_start|>huma<|role_ed|>这是第1轮用户输入的问题
<|role_start|>bot<|role_ed|>这是第1轮模型生成的内容</s>
<|role_start|>huma<|role_ed|>这是第2轮用户输入的问题
<|role_start|>bot<|role_ed|>这是第2轮模型生成的内容</s>
...
...
...
<|role_start|>huma<|role_ed|>这是第轮用户输入的问题
<|role_start|>bot<|role_ed|>{模型现在要生成的内容}</s>
"""

推理时,请确保拼接的prompt字符串以"<|rolestart|>bot<|roleed|>"结尾,引导模型生成回答。

快速使用

import torch
from modelscope import AutoTokeizer, AutoModelForCausalLM, sapshot_dowload


model_dir = sapshot_dowload('codefuse-ai/CodeFuse-CodeLlama-34B', revisio='v1.0.0')
tokeizer = AutoTokeizer.from_pretraied(model_dir, trust_remote_code=True, use_fast=False, legacy=False)
tokeizer.paddig_side = "left"
tokeizer.pad_toke_id = tokeizer.covert_tokes_to_ids("<uk>")
tokeizer.eos_toke_id = tokeizer.covert_tokes_to_ids("</s>")
model = AutoModelForCausalLM.from_pretraied(model_dir, trust_remote_code=True, 
                                             device_map='auto', 
                                             torch_dtype=torch.bfloat16)

HUMAN_ROLE_START_TAG = "<|role_start|>huma<|role_ed|>"
BOT_ROLE_START_TAG = "<|role_start|>bot<|role_ed|>"

text = f"{HUMAN_ROLE_START_TAG}write a pytho fuctio of quick sort.{BOT_ROLE_START_TAG}" 
iputs = tokeizer(text, retur_tesors='pt', paddig=True, add_special_tokes=False).to("cuda")
outputs = model.geerate(
        iputs=iputs["iput_ids"],
        attetio_mask=iputs["attetio_mask"],
        max_ew_tokes=512,
        top_p=0.95,
        temperature=0.1,
        do_sample=True,
        eos_toke_id=tokeizer.eos_toke_id,
        pad_toke_id=tokeizer.pad_toke_id
    )
ge_text = tokeizer.batch_decode(outputs[:, iputs["iput_ids"].shape[1]:], skip_special_tokes=True)
prit(ge_text)

MD5

我们发现模型文件可能会在传输过程中损坏,使用前请检查文件MD5值。

模型文件 MD5值
pytorch_model-00001-of-00007.bi 8d544b1bcb3449934184d4141137329c
pytorch_model-00002-of-00007.bi 9d5dbb30911e48a42fb6d0fcabb322a4
pytorch_model-00003-of-00007.bi b0d4aecee0457d9332005a187e1fffed
pytorch_model-00004-of-00007.bi 5c7e002de5eab77d0194a2b0f6de0c24
pytorch_model-00005-of-00007.bi d22a511aa26b5b17117b665a877490ab
pytorch_model-00006-of-00007.bi a5c28ac277fac07d16dd66537e54d109
pytorch_model-00007-of-00007.bi a967e2c6195477b7407089c0bffa2d53

加入我们

我们是平台技术事业群AI Native团队,负责蚂蚁蚂蚁集团平台工程的智能化,团队成立3年多以来,支持了蚂蚁集团云计算基础设施智能化运维的升级改造。团队的Missio是,通过世界级的技术创新和影响,构建有广泛用户的算法服务和平台,支撑内外部产品和业务落地。团队秉承创新基因,在支撑业务落地的同时,推动技术影响。3年以来在ICLR、NeurIPS、KDD、ACL等顶会发表论文20余篇,创新业务结果获得两次蚂蚁技术最高奖T-Star,1次蚂蚁集团最高奖SuperMA。开源项目CodeFuse获得4K点赞(2024年2月),Huggigface和modelscope上模型累积下载量超过150万次。

我们正在寻找行业中的佼佼者加入我们的团队!如果您希望在一个充满活力、创新和卓越文化的环境中发展您的职业生涯,欢迎您查看我们的社招&校招机会,加入我们,一起创造下一个行业里程碑。

校招:https://hrrecommed.atgroup.com/guide.html?code=8uoP5mlus5DqQYbEEqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GL7

社招:https://talet.atgroup.com/off-campus-positio?positioId=1933830

功能介绍

Model Card for CodeFuse-CodeLlama-34B [中文] [English] Model Description CodeFuse-CodeLlama

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论