开源地址
https://modelscope.cn/models/AI-ModelScope/imp-v1-3b授权协议
Apache License 2.0

? Imp

A very small ma ca cast a very large shadow.

&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;&bsp;——George R.R. Marti, A Clash of Kigs

[Techical report (comig soo)]&bsp;&bsp;[Demo]&bsp;&bsp;[Github]

Itroductio

The Imp project aims to provide a family of a strog multimodal small laguage models (MSLMs). Our imp-v1-3b is a strog MSLM with oly 3B parameters, which is build upo a small yet powerful SLM Phi-2 (2.7B) ad a powerful visual ecoder SigLIP (0.4B), ad traied o the LLaVA-v1.5 traiig set.

As show i the Table below, imp-v1-3b sigificatly outperforms the couterparts of similar model sizes, ad eve achieves slightly better performace tha the strog LLaVA-7B model o various multimodal bechmarks.

We release our model weights ad provide a example below to ru our model . Detailed techical report ad correspodig traiig/evaluatio code will be released soo o our GitHub repo. We will persistetly improve our model ad release the ext versios to further improve model performace :)

How to use

Istall modelscope

pip istall modelscope 
pip istall -q pillow accelerate eiops

You ca use the followig code for model iferece. The format of text istructio is similar to LLaVA.

import torch
from modelscope import AutoModelForCausalLM, AutoTokeizer
from PIL import Image

torch.set_default_device("cuda")

#Create model
model = AutoModelForCausalLM.from_pretraied(
    "AI-ModelScope/imp-v1-3b", 
    torch_dtype=torch.float16, 
    device_map="auto",
    trust_remote_code=True)
tokeizer = AutoTokeizer.from_pretraied("AI-ModelScope/imp-v1-3b", trust_remote_code=True)

#Set iputs
text = "A chat betwee a curious user ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the user's questios. USER: <image>\What are the colors of the bus i the image? ASSISTANT:"
image = Image.ope("images/bus.jpg")

iput_ids = tokeizer(text, retur_tesors='pt').iput_ids
image_tesor = model.image_preprocess(image)

#Geerate the aswer
output_ids = model.geerate(
    iput_ids,
    max_ew_tokes=100,
    images=image_tesor,
    use_cache=True)[0]
prit(tokeizer.decode(output_ids[iput_ids.shape[1]:], skip_special_tokes=True).strip())

Model evaluatio

We coduct evaluatio o 9 commoly-used bechmarks, icludig 5 academic VQA bechmarks ad 4 popular MLLM bechmarks, to compare our Imp model with LLaVA (7B) ad existig MSLMs of similar model sizes.

Models	Size	VQAv2	GQA	VizWiz	SQA(IMG)	TextVQA	POPE	MME(P)	MMB	MM-Vet
LLaVA-v1.5-lora	7B	79.10	63.00	47.80	68.40	58.20	86.40	1476.9	66.10	30.2
TiyGPT-V	3B	-	33.60	24.80	-	-	-	-	-	-
LLaVA-Phi	3B	71.40	-	35.90	68.40	48.60	85.00	1335.1	59.80	28.9
MobileVLM	3B	-	59.00	-	61.00	47.50	84.90	1288.9	59.60	-
MC-LLaVA-3b	3B	64.24	49.60	24.88	-	38.59	80.59	-	-	-
Imp-v1 (ours)	3B	79.45	58.55	50.09	69.96	59.38	88.02	1434.0	66.49	33.1

Examples

example1

Licese

This project is licesed uder the Apache Licese 2.0 - see the LICENSE file for details.

About us

This project is maitaied by the MILVLG@Hagzhou Diazi Uiversity (HDU) led by Prof. Zhou Yu ad Ju Yu, ad is maily developed by Zhewei Shao ad Xuecheg Ouyag. We hope our model may serve as a strog baselie to ispire future research o MSLM, as well as its derivative applicatios o mobile devices ad robots.

Citatio

If you use our model or refer our work i your studies, please cite:

@misc{imp2024,
  author = {Shao, Zhewei ad Ouyag, Xuecheg ad Yu, Zhou ad Yu, Ju},
  title = {Imp-v1: A Emprical Study of Multimodal Small Laguage Models},
  year = {2024},
  url = {https://huggigface.co/MILVLG/imp-v1-3b}
}

? Imp A very small man can cast a very large shadow. ——George R.R. Martin, A Clash of

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

imp-v1-3b

技术信息

作品详情