LMDeploy adopts AWQ algorithm for 4bit weight-oly quatizatio. By developed the high-performace cuda kerel, the 4bit quatized model iferece achieves up to 2.4x faster tha FP16. LMDeploy supports the followig NVIDIA GPU for W4A16 iferece: Turig(sm75): 20 series, T4 Ampere(sm80,sm86): 30 series, A10, A16, A30, A100 Ada Lovelace(sm90): 40 series Before proceedig with the quatizatio ad iferece, please esure that lmdeploy is istalled. This article comprises the followig sectios: Please dowload Tryig the followig codes, you ca perform the batched offlie iferece with the quatized model: For more iformatio about the pipelie parameters, please refer to here. Please overview this guide about model evaluatio with LMDeploy. LMDeploy's The default port of You ca overview ad try out INT4 Weight-oly Quatizatio ad Deploymet (W4A16)
pip istall lmdeploy[all]
Iferece
iterlm2-chat-20b-4bit
model as follows,git-lfs istall
git cloe --depth=1 https://www.modelscope.c/Shaghai_AI_Laboratory/iterlm2-chat-20b-4bits.git
from lmdeploy import pipelie, TurbomidEgieCofig
egie_cofig = TurbomidEgieCofig(model_format='awq')
pipe = pipelie("./iterlm2-chat-20b-4bits", backed_cofig=egie_cofig)
respose = pipe(["Hi, pls itro yourself", "Shaghai is"])
prit(respose)
Evaluatio
Service
api_server
eables models to be easily packed ito services with a sigle commad. The provided RESTful APIs are compatible with OpeAI's iterfaces. Below are a example of service startup:lmdeploy serve api_server ./iterlm2-chat-20b-4bits --backed turbomid --model-format awq
api_server
is 23333
. After the server is lauched, you ca commuicate with server o termial through api_cliet
:lmdeploy serve api_cliet http://0.0.0.0:23333
api_server
APIs olie by swagger UI at http://0.0.0.0:23333
, or you ca also read the API specificatio from here.
点击空白处退出提示
评论