A very small ma ca cast a very large shadow. &bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;——George R.R. Marti, A Clash of Kigs [Techical report (comig soo)]&bsp;&bsp;[Demo]&bsp;&bsp;[Github] The Imp project aims to provide a family of a strog multimodal As show i the Table below, We release our model weights ad provide a example below to ru our model . Detailed techical report ad correspodig traiig/evaluatio code will be released soo o our GitHub repo. We will persistetly improve our model ad release the ext versios to further improve model performace :) You ca use the followig code for model iferece. The format of text istructio is similar to LLaVA. We coduct evaluatio o 9 commoly-used bechmarks, icludig 5 academic VQA bechmarks ad 4 popular MLLM bechmarks, to compare our Imp model with LLaVA (7B) ad existig MSLMs of similar model sizes. This project is licesed uder the Apache Licese 2.0 - see the LICENSE file for details. This project is maitaied by the MILVLG@Hagzhou Diazi Uiversity (HDU) led by Prof. Zhou Yu ad Ju Yu, ad is maily developed by Zhewei Shao ad Xuecheg Ouyag. We hope our model may serve as a strog baselie to ispire future research o MSLM, as well as its derivative applicatios o mobile devices ad robots. If you use our model or refer our work i your studies, please cite:? Imp
Itroductio
small
laguage models (MSLMs). Our imp-v1-3b
is a strog MSLM with oly imp-v1-3b
sigificatly outperforms the couterparts of similar model sizes, ad eve achieves slightly better performace tha the strog LLaVA-7B model o various multimodal bechmarks. How to use
pip istall modelscope
pip istall -q pillow accelerate eiops
import torch
from modelscope import AutoModelForCausalLM, AutoTokeizer
from PIL import Image
torch.set_default_device("cuda")
#Create model
model = AutoModelForCausalLM.from_pretraied(
"AI-ModelScope/imp-v1-3b",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True)
tokeizer = AutoTokeizer.from_pretraied("AI-ModelScope/imp-v1-3b", trust_remote_code=True)
#Set iputs
text = "A chat betwee a curious user ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the user's questios. USER: <image>\What are the colors of the bus i the image? ASSISTANT:"
image = Image.ope("images/bus.jpg")
iput_ids = tokeizer(text, retur_tesors='pt').iput_ids
image_tesor = model.image_preprocess(image)
#Geerate the aswer
output_ids = model.geerate(
iput_ids,
max_ew_tokes=100,
images=image_tesor,
use_cache=True)[0]
prit(tokeizer.decode(output_ids[iput_ids.shape[1]:], skip_special_tokes=True).strip())
Model evaluatio
Models
Size
VQAv2
GQA
VizWiz
SQA(IMG)
TextVQA
POPE
MME(P)
MMB
MM-Vet
LLaVA-v1.5-lora
7B
79.10
47.80
68.40
58.20
86.40
66.10
30.2
TiyGPT-V
3B
-
33.60
24.80
-
-
-
-
-
-
LLaVA-Phi
3B
71.40
-
35.90
68.40
48.60
85.00
1335.1
59.80
28.9
MobileVLM
3B
-
59.00
-
61.00
47.50
84.90
1288.9
59.60
-
MC-LLaVA-3b
3B
64.24
49.60
24.88
-
38.59
80.59
-
-
-
3B
58.55
1434.0
Examples
Licese
About us
Citatio
@misc{imp2024,
author = {Shao, Zhewei ad Ouyag, Xuecheg ad Yu, Zhou ad Yu, Ju},
title = {Imp-v1: A Emprical Study of Multimodal Small Laguage Models},
year = {2024},
url = {https://huggigface.co/MILVLG/imp-v1-3b}
}
点击空白处退出提示
评论