Itroductio Qwe1.5 is the beta versio of Qwe2, a trasformer-based decoder-oly laguage model pretraied o a large amout of data. I compariso with the previous released Qwe, the improvemets iclude: For more details, please refer to our blog post ad GitHub repo.
Model Details Qwe1.5 is a laguage model series icludig decoder laguage models of differet model sizes. For each size, we release the base laguage model ad the aliged chat model. It is based o the Trasformer architecture with SwiGLU activatio, attetio QKV bias, group query attetio, mixture of slidig widow attetio ad full attetio, etc. Additioally, we have a improved tokeizer adaptive to multiple atural laguages ad codes. For the beta versio, temporarily we did ot iclude GQA (except for 32B) ad the mixture of SWA ad full attetio.
Requiremets The code of Qwe1.5 has bee i the latest Huggig face trasformers ad we advise you to istall trasformers>=4.37.0, or you might ecouter the followig error: from trasformers import AutoTokeizer, TextGeeratioPipelie
from auto_gptq import AutoGPTQForCausalLM, BaseQuatizeCofig
import loggig loggig.basicCofig(
format="%(asctime)s %(levelame)s [%(ame)s] %(message)s", level=loggig.INFO, datefmt="%Y-%m-%d %H:%M:%S"
) pretraiedmodeldir = "Qwe/Qwe1.5-1.8B"
quatizedmodeldir = "local " tokeizer = AutoTokeizer.frompretraied(pretraiedmodeldir, usefast=True)
examples = [
tokeizer(
"auto-gptq is a easy-to-use model quatizatio library with user-friedly apis, based o GPTQ algorithm."
)
] quatizecofig = BaseQuatizeCofig(
bits=4, # quatize model to 4-bit
groupsize=128, # it is recommeded to set the value to 128
desc_act=False, # set to False ca sigificatly speed up iferece but the perplexity may slightly bad
) model = AutoGPTQForCausalLM.frompretraied(pretraiedmodeldir, quatizecofig) model.quatize(examples) model.savequatized(quatizedmodel_dir) model.savequatized(quatizedmodeldir, usesafetesors=True) model = AutoGPTQForCausalLM.fromquatized(quatizedmodel_dir, device="cuda:0") prit(tokeizer.decode(model.geerate(**tokeizer("autogptq is", returtesors="pt").to(model.device))[0])) GGUF add fuctio callig & C++ quatize INT4 pipelie = TextGeeratioPipelie(model=model, tokeizer=tokeizer)
prit(pipelie("auto-gptq is")[0]["geerated_text"])8 model sizes, icludig 0.5B, 1.8B, 4B, 7B, 14B, 32B ad 72B dese models, ad a MoE model of 14B with 2.7B activated;
Sigificat performace improvemet i Chat models;
Multiligual support of both base ad chat models;
Stable support of 32K cotext legth for models of all sizes
No eed of trust_remote_code.
load u-quatized model, by default, the model will always be loaded ito CPU memory
quatize model, the examples should be list of dict whose keys ca oly be "iputids" ad "attetiomask"
save quatized model
save quatized model usig safetesors
push quatized model to Huggig Face Hub.
to use useauthtoke=True, Logi first via huggigface-cli logi.
or pass explcit toke with: useauthtoke="hf_xxxxxxx"
(ucommet the followig three lies to eable this feature)
repoid = f"YourUserName/{quatizedmodel_dir}"
commitmessage = f"AutoGPTQ model for {pretraiedmodeldir}: {quatizecofig.bits}bits, gr{quatizecofig.groupsize}, descact={quatizecofig.desc_act}"
model.pushtohub(repoid, commitmessage=commitmessage, useauth_toke=True)
alteratively you ca save ad push at the same time
(ucommet the followig three lies to eable this feature)
repoid = f"YourUserName/{quatizedmodel_dir}"
commitmessage = f"AutoGPTQ model for {pretraiedmodeldir}: {quatizecofig.bits}bits, gr{quatizecofig.groupsize}, descact={quatizecofig.desc_act}"
model.pushtohub(repoid, savedir=quatizedmodeldir, usesafetesors=True, commitmessage=commitmessage, useauth_toke=True)
load quatized model to the first GPU
dowload quatized model from Huggig Face Hub ad load to the first GPU
model = AutoGPTQForCausalLM.fromquatized(repoid, device="cuda:0", usesafetesors=True, usetrito=False)
iferece with model.geerate
or you ca also use pipelie
点击空白处退出提示
评论