书生·浦语-对话-20B-4bit

我要开发同款
匿名用户2024年07月31日
54阅读

技术信息

官网地址
https://www.shlab.org.cn/
开源地址
https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b-4bit
授权协议
Apache License 2.0

作品详情

LMDeploy supports LLM model iferece of 4-bit weight, with the miimum requiremet for NVIDIA graphics cards beig sm80, such as A10, A100, Geforce 30/40 series.

Before proceedig with the iferece of iterlm-chat-20b-4bit, please esure that lmdeploy is istalled.

pip istall 'lmdeploy>=0.0.11'

Iferece

Please dowload iterlm-chat-20b-4bit model as follows,

git-lfs istall
git cloe --depth=1 https://www.modelscope.c/Shaghai_AI_Laboratory/iterlm-chat-20b-4bit.git

As demostrated i the commad below, first covert the model's layout usig turbomid.deploy, ad the you ca iteract with the AI assistat i the termial

# Covert the model's layout ad store it i the default path, ./workspace.
pytho3 -m lmdeploy.serve.turbomid.deploy \
    --model-ame iterlm-chat-20b \
    --model-path ./iterlm-chat-20b-4bit \
    --model-format awq \
    --group-size 128

# iferece
pytho3 -m lmdeploy.turbomid.chat ./workspace

Serve with gradio

If you wish to iteract with the model via web UI, please iitiate the gradio server as idicated below:

pytho3 -m lmdeploy.serve.gradio.app ./workspace --server_ame {ip_addr} --server_port {port}

Subsequetly, you ca ope the website http://{ip_addr}:{port} i your browser ad iteract with the model.

Besides servig with gradio, there are two more servig methods. Oe is servig with Trito Iferece Server (TIS), ad the other is a OpeAI-like server amed as api_server.

Please refer to the user guide for detailed iformatio if you are iterested.

Iferece Performace

LMDeploy provides scripts for bechmarkig toke throughput ad request throughput.

toke throughput tests the speed of geeratig ew tokes, give a specified umber of prompt tokes ad completio tokes, while request throughput measures the umber of requests processed per miute with real dialogue data.

We coducted bechmarks o iterlm-chat-20b-4bit. Ad toke_throughput was measured by settig 256 prompt tokes ad geeratig 512 tokes i respose o A100-80G.

Note: The sessio_le i workspace/trito_models/weights/cofig.ii is chaged to 2056 i our test.

batch tesor parallel prompt_tokes completio_tokes thrperproc(toke/s) rpm (req/mi) memperproc(GB)
1 1 256 512 88.77 - 15.65
16 1 256 512 792.7 220.23 51.46

toke throughput

Ru the followig commad,

pytho bechmark/profile_geeratio.py \
  --model-path ./workspace \
  --cocurrecy 1 8 16 --prompt-tokes 256 512 512 1024 --completio-tokes 512 512 1024 1024
  --dst-csv ./toke_throughput.csv

You will fid the toke_throughput metrics i ./toke_throughput.csv

batch prompt_tokes completio_tokes thrperproc(toke/s) thrperode(toke/s) rpm(req/mi) memperproc(GB) mempergpu(GB) memperode(GB)
1 256 512 88.77 710.12 - 15.65 15.65 125.21
1 512 512 83.89 671.15 - 15.68 15.68 125.46
1 512 1024 80.19 641.5 - 15.68 15.68 125.46
1 1024 1024 72.34 578.74 - 15.75 15.75 125.96
1 1 2048 80.69 645.55 - 15.62 15.62 124.96
8 256 512 565.21 4521.67 - 32.37 32.37 258.96
8 512 512 489.04 3912.33 - 32.62 32.62 260.96
8 512 1024 467.23 3737.84 - 32.62 32.62 260.96
8 1024 1024 383.4 3067.19 - 33.06 33.06 264.46
8 1 2048 487.74 3901.93 - 32.12 32.12 256.96
16 256 512 792.7 6341.6 - 51.46 51.46 411.71
16 512 512 639.4 5115.17 - 51.93 51.93 415.46
16 512 1024 591.39 4731.09 - 51.93 51.93 415.46
16 1024 1024 449.11 3592.85 - 52.06 52.06 416.46
16 1 2048 620.5 4964.02 - 51 51 407.96

request throughput

LMDeploy uses ShareGPT dataset to test request throughput. Try the ext commads, ad you will get the rpm (request per miute) metric.

# dowload the ShareGPT dataset
wget https://huggigface.co/datasets/ao8231489123/ShareGPT_Vicua_ufiltered/resolve/mai/ShareGPT_V3_ufiltered_cleaed_split.jso

# ru bechmark script
pytho profile_throughput.py \
 ShareGPT_V3_ufiltered_cleaed_split.jso \
 ./workspace \
 --cocurrecy 16

功能介绍

LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graph

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论