开源地址
https://modelscope.cn/models/qwen/Qwen2-7B-Instruct-GPTQ-Int8授权协议
apache-2.0

Qwe2-7B-Istruct-GPTQ-It8

Itroductio

Qwe2 is the ew series of Qwe large laguage models. For Qwe2, we release a umber of base laguage models ad istructio-tued laguage models ragig from 0.5 to 72 billio parameters, icludig a Mixture-of-Experts model. This repo cotais the istructio-tued 7B Qwe2 model.

Compared with the state-of-the-art opesource laguage models, icludig the previous released Qwe1.5, Qwe2 has geerally surpassed most opesource models ad demostrated competitiveess agaist proprietary models across a series of bechmarks targetig for laguage uderstadig, laguage geeratio, multiligual capability, codig, mathematics, reasoig, etc.

Qwe2-7B-Istruct-GPTQ-It8 supports a cotext legth of up to 131,072 tokes, eablig the processig of extesive iputs. Please refer to this sectio for detailed istructios o how to deploy Qwe2 for hadlig log texts.

For more details, please refer to our blog, GitHub, ad Documetatio.

Note: If you ecouter RutimeError: probability tesor cotais either `if`, `a` or elemet < 0 durig iferece with trasformer, we recommad deployig this model with vLLM.

Model Details

Qwe2 is a laguage model series icludig decoder laguage models of differet model sizes. For each size, we release the base laguage model ad the aliged chat model. It is based o the Trasformer architecture with SwiGLU activatio, attetio QKV bias, group query attetio, etc. Additioally, we have a improved tokeizer adaptive to multiple atural laguages ad codes.

Traiig details

We pretraied the models with a large amout of data, ad we post-traied the models with both supervised fietuig ad direct preferece optimizatio.

Requiremets

The code of Qwe2 has bee i the latest Huggig face trasformers ad we advise you to istall trasformers>=4.37.0, or you might ecouter the followig error:

KeyError: 'qwe2'

Quickstart

Here provides a code sippet with apply_chat_template to show you how to load the tokeizer ad model ad how to geerate cotets.

from modelscope import AutoModelForCausalLM, AutoTokeizer
device = "cuda" # the device to load the model oto

model = AutoModelForCausalLM.from_pretraied(
    "qwe/Qwe2-7B-Istruct-GPTQ-It8",
    torch_dtype="auto",
    device_map="auto"
)
tokeizer = AutoTokeizer.from_pretraied("qwe/Qwe2-7B-Istruct-GPTQ-It8")

prompt = "Give me a short itroductio to large laguage model."
messages = [
    {"role": "system", "cotet": "You are a helpful assistat."},
    {"role": "user", "cotet": prompt}
]
text = tokeizer.apply_chat_template(
    messages,
    tokeize=False,
    add_geeratio_prompt=True
)
model_iputs = tokeizer([text], retur_tesors="pt").to(device)

geerated_ids = model.geerate(
    model_iputs.iput_ids,
    max_ew_tokes=512
)
geerated_ids = [
    output_ids[le(iput_ids):] for iput_ids, output_ids i zip(model_iputs.iput_ids, geerated_ids)
]

respose = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True)[0]
prit(respose)

Processig Log Texts

To hadle extesive iputs exceedig 32,768 tokes, we utilize YARN, a techique for ehacig model legth extrapolatio, esurig optimal performace o legthy texts.

For deploymet, we recommed usig vLLM. You ca eable the log-cotext capabilities by followig these steps:

Istall vLLM: You ca istall vLLM by ruig the followig commad.

pip istall "vllm>=0.4.3"

Or you ca istall vLLM from source.

Cofigure Model Settigs: After dowloadig the model weights, modify the cofig.jso file by icludig the below sippet:

    {
        "architectures": [
            "Qwe2ForCausalLM"
        ],
        // ...
        "vocab_size": 152064,    // addig the followig sippets
    "rope_scalig": {
        "factor": 4.0,
        "origial_max_positio_embeddigs": 32768,
        "type": "yar"
    }
}

This sippet eable YARN to support loger cotexts.

Model Deploymet: Utilize vLLM to deploy your model. For istace, you ca set up a opeAI-like server usig the commad:

pytho -m vllm.etrypoits.opeai.api_server --served-model-ame Qwe2-7B-Istruct-GPTQ-It8 --model path/to/weights

The you ca access the Chat API by:

curl http://localhost:8000/v1/chat/completios \
    -H "Cotet-Type: applicatio/jso" \
    -d '{
    "model": "Qwe2-7B-Istruct-GPTQ-It8",
    "messages": [
        {"role": "system", "cotet": "You are a helpful assistat."},
        {"role": "user", "cotet": "Your Log Iput Here."}
    ]
    }'

For further usage istructios of vLLM, please refer to our Github.

Note: Presetly, vLLM oly supports static YARN, which meas the scalig factor remais costat regardless of iput legth, potetially impactig performace o shorter texts. We advise addig the rope_scalig cofiguratio oly whe processig log cotexts is required.

Bechmark ad Speed

To compare the geeratio performace betwee bfloat16 (bf16) ad quatized models such as GPTQ-It8, GPTQ-It4, ad AWQ, please cosult our Bechmark of Quatized Models. This bechmark provides isights ito how differet quatizatio techiques affect model performace.

For those iterested i uderstadig the iferece speed ad memory cosumptio whe deployig these models with either trasformer or vLLM, we have compiled a extesive Speed Bechmark.

Citatio

If you fid our work helpful, feel free to give us a cite.

@article{qwe2,
  title={Qwe2 Techical Report},
  year={2024}
}

Qwen2-7B-Instruct-GPTQ-Int8 Introduction Qwen2 is the new series of Qwen large language models. For

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

Qwen2-7B-Instruct-GPTQ-Int8

技术信息

作品详情