LMDeploy supports LLM model iferece of 4-bit weight, with the miimum requiremet for NVIDIA graphics cards beig sm80, such as A10, A100, Geforce 30/40 series. Before proceedig with the iferece of Please dowload As demostrated i the commad below, first covert the model's layout usig If you wish to iteract with the model via web UI, please iitiate the gradio server as idicated below: Subsequetly, you ca ope the website Besides servig with gradio, there are two more servig methods. Oe is servig with Trito Iferece Server (TIS), ad the other is a OpeAI-like server amed as Please refer to the user guide for detailed iformatio if you are iterested. LMDeploy provides scripts for bechmarkig We coducted bechmarks o Ru the followig commad, You will fid the LMDeploy uses ShareGPT dataset to test request throughput. Try the ext commads, ad you will get the iterlm-chat-20b-4bit
, please esure that lmdeploy is istalled.pip istall 'lmdeploy>=0.0.11'
Iferece
iterlm-chat-20b-4bit
model as follows,git-lfs istall
git cloe --depth=1 https://www.modelscope.c/Shaghai_AI_Laboratory/iterlm-chat-20b-4bit.git
turbomid.deploy
, ad the you ca iteract with the AI assistat i the termial# Covert the model's layout ad store it i the default path, ./workspace.
pytho3 -m lmdeploy.serve.turbomid.deploy \
--model-ame iterlm-chat-20b \
--model-path ./iterlm-chat-20b-4bit \
--model-format awq \
--group-size 128
# iferece
pytho3 -m lmdeploy.turbomid.chat ./workspace
Serve with gradio
pytho3 -m lmdeploy.serve.gradio.app ./workspace --server_ame {ip_addr} --server_port {port}
http://{ip_addr}:{port}
i your browser ad iteract with the model.api_server
.Iferece Performace
toke throughput
ad request throughput
.toke throughput
tests the speed of geeratig ew tokes, give a specified umber of prompt tokes ad completio tokes, while request throughput
measures the umber of requests processed per miute with real dialogue data.iterlm-chat-20b-4bit
. Ad toke_throughput
was measured by settig 256 prompt tokes ad geeratig 512 tokes i respose o A100-80G.sessio_le
i workspace/trito_models/weights/cofig.ii
is chaged to 2056
i our test.
batch
tesor parallel
prompt_tokes
completio_tokes
thrperproc(toke/s)
rpm (req/mi)
memperproc(GB)
1
1
256
512
88.77
-
15.65
16
1
256
512
792.7
220.23
51.46
toke throughput
pytho bechmark/profile_geeratio.py \
--model-path ./workspace \
--cocurrecy 1 8 16 --prompt-tokes 256 512 512 1024 --completio-tokes 512 512 1024 1024
--dst-csv ./toke_throughput.csv
toke_throughput
metrics i ./toke_throughput.csv
batch
prompt_tokes
completio_tokes
thrperproc(toke/s)
thrperode(toke/s)
rpm(req/mi)
memperproc(GB)
mempergpu(GB)
memperode(GB)
1
256
512
88.77
710.12
-
15.65
15.65
125.21
1
512
512
83.89
671.15
-
15.68
15.68
125.46
1
512
1024
80.19
641.5
-
15.68
15.68
125.46
1
1024
1024
72.34
578.74
-
15.75
15.75
125.96
1
1
2048
80.69
645.55
-
15.62
15.62
124.96
8
256
512
565.21
4521.67
-
32.37
32.37
258.96
8
512
512
489.04
3912.33
-
32.62
32.62
260.96
8
512
1024
467.23
3737.84
-
32.62
32.62
260.96
8
1024
1024
383.4
3067.19
-
33.06
33.06
264.46
8
1
2048
487.74
3901.93
-
32.12
32.12
256.96
16
256
512
792.7
6341.6
-
51.46
51.46
411.71
16
512
512
639.4
5115.17
-
51.93
51.93
415.46
16
512
1024
591.39
4731.09
-
51.93
51.93
415.46
16
1024
1024
449.11
3592.85
-
52.06
52.06
416.46
16
1
2048
620.5
4964.02
-
51
51
407.96
request throughput
rpm
(request per miute) metric.# dowload the ShareGPT dataset
wget https://huggigface.co/datasets/ao8231489123/ShareGPT_Vicua_ufiltered/resolve/mai/ShareGPT_V3_ufiltered_cleaed_split.jso
# ru bechmark script
pytho profile_throughput.py \
ShareGPT_V3_ufiltered_cleaed_split.jso \
./workspace \
--cocurrecy 16
点击空白处退出提示
评论