Qwe2 is the ew series of Qwe large laguage models. For Qwe2, we release a umber of base laguage models ad istructio-tued laguage models ragig from 0.5 to 72 billio parameters, icludig a Mixture-of-Experts model. This repo cotais the istructio-tued 7B Qwe2 model. Compared with the state-of-the-art opesource laguage models, icludig the previous released Qwe1.5, Qwe2 has geerally surpassed most opesource models ad demostrated competitiveess agaist proprietary models across a series of bechmarks targetig for laguage uderstadig, laguage geeratio, multiligual capability, codig, mathematics, reasoig, etc. For more details, please refer to our blog, GitHub, ad Documetatio. I this repo, we provide Qwe2 is a laguage model series icludig decoder laguage models of differet model sizes. For each size, we release the base laguage model ad the aliged chat model. It is based o the Trasformer architecture with SwiGLU activatio, attetio QKV bias, group query attetio, etc. Additioally, we have a improved tokeizer adaptive to multiple atural laguages ad codes. We pretraied the models with a large amout of data, ad we post-traied the models with both supervised fietuig ad direct preferece optimizatio. We advise you to cloe Cloig the repo may be iefficiet, ad thus you ca maually dowload the GGUF file that you eed or use To ru Qwe2, you ca use (Note: The it is easy to access the deployed service with OpeAI API: If you choose to use We implemet perplexity evaluatio usig wikitext followig the practice of If you fid our work helpful, feel free to give us a cite.Qwe2-7B-Istruct-GGUF
Itroductio
fp16
model ad quatized models i the GGUF formats, icludig q5_0
, q5_k_m
, q6_k
ad q8_0
.Model Details
Traiig details
Requiremets
llama.cpp
ad istall it followig the official guide. We follow the latest versio of llama.cpp.
I the followig demostratio, we assume that you are ruig commads uder the repository llama.cpp
.How to use
modelscope cli
(pip istall modelscope
) as show below:modelscope dowload --model=qwe/Qwe2-7B-Istruct-GGUF --local_dir . qwe2-7b-istruct-q5_k_m.gguf
llama-cli
(the previous mai
) or llama-server
(the previous server
).
We recommed usig the llama-server
as it is simple ad compatible with OpeAI API. For example:./llama-server -m qwe2-7b-istruct-q5_k_m.gguf -gl 28 -fa
-gl 28
refers to offloadig 24 layers to GPUs, ad -fa
refers to the use of flash attetio.)import opeai
cliet = opeai.OpeAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-o-key-required"
)
completio = cliet.chat.completios.create(
model="qwe",
messages=[
{"role": "system", "cotet": "You are a helpful assistat."},
{"role": "user", "cotet": "tell me somethig about michael jorda"}
]
)
prit(completio.choices[0].message.cotet)
llama-cli
, pay attetio to the removal of -cml
for the ChatML template. Istead you should use --i-prefix
ad --i-suffix
to tackle this problem../llama-cli -m qwe2-7b-istruct-q5_k_m.gguf \
- 512 -co -i -if -f prompts/chat-with-qwe.txt \
--i-prefix "<|im_start|>user\" \
--i-suffix "<|im_ed|>\<|im_start|>assistat\" \
-gl 24 -fa
Evaluatio
llama.cpp
with ./llama-perplexity
(the previous ./perplexity
).
I the followig we report the PPL of GGUF models of differet sizes ad differet quatizatio levels.
Size
fp16
q8_0
q6_k
q5km
q5_0
q4km
q4_0
q3km
q2_k
iq1_m
0.5B
15.11
15.13
15.14
15.24
15.40
15.36
16.28
15.70
16.74
-
1.5B
10.43
10.43
10.45
10.50
10.56
10.61
10.79
11.08
13.04
-
7B
7.93
7.94
7.96
7.97
7.98
8.02
8.19
8.20
10.58
-
57B-A14B
6.81
6.81
6.83
6.84
6.89
6.99
7.02
7.43
-
-
72B
5.58
5.58
5.59
5.59
5.60
5.61
5.66
5.68
5.91
6.75
Citatio
@article{qwe2,
title={Qwe2 Techical Report},
year={2024}
}
点击空白处退出提示
评论