开源地址
https://modelscope.cn/models/iic/ONE-PEACE-4B授权协议
Apache License 2.0

ONE-PEACE

Paper&bsp;&bsp; ｜ &bsp;&bsp;Demo&bsp;&bsp; | &bsp;&bsp;Checkpoits&bsp;&bsp; ｜ &bsp;&bsp;Datasets&bsp;&bsp; | &bsp;&bsp;GitHub

ONE-PEACE是什么

ONE-PEACE是一个图文音三模态通用表征模型，在语义分割、音文检索、音频分类和视觉定位几个任务都达到了新SOTA表现，在视频分类、图像分类图文检索、以及多模态经典bechmark也都取得了比较领先的结果。另外，模型展现出来新的zeroshot能力，即实现了新的模态对齐，比如音频和图像的对齐，或者音频+文字和图像的对齐，而这类数据并没有出现在我们的预训练数据集里。

下面这张图展示了ONE-PEACE的模型架构和预训练任务。借助于扩展友好的架构和模态无关的任务，ONE-PEACE具备扩展到无限模态的潜力

如何玩转ONE-PEACE

基础配置

# modelscope的otebook不需要安装modelscope
# modelscope镜像请使用
# GPU： registry.c-hagzhou.aliyucs.com/modelscope-repo/modelscope:ubutu20.04-cuda11.3.0-py38-torch1.11.0-tf1.15.5-1.8.1
# CPU： registry.c-hagzhou.aliyucs.com/modelscope-repo/modelscope:ubutu20.04-py38-torch1.11.0-tf1.15.5-1.8.1
# 需要torch<2.0.0
# pip istall modelscope
git cloe https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE 
pip istall -r requiremets.txt

开始玩起来

from modelscope.models import Model
from modelscope.pipelies import pipelie

iferece = pipelie('multimodal_embeddig', model='damo/ONE-PEACE-4B', model_revisio='v1.0.2', use_gpu=False)
text_features = iferece(["bird", "dog", "pada"], data_type='text')
image_features = iferece(
    ['https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/dog.JPEG', 'https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/pada.JPEG'],
    data_type='image'
)
audio_features = iferece(
    ['https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/bird.flac', 'https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/dog.flac'],
    data_type='audio'
)

# compute similarity
i2t_similarity = image_features @ text_features.T
a2t_similarity = audio_features @ text_features.T
prit("Image-to-text similarities:", i2t_similarity)
prit("Audio-to-text similarities:", a2t_similarity)

为什么ONE-PEACE是多模态表征模型的最佳选择？

作为一个4B规模的通用表征模型，ONE-PEACE在一系列视觉、语音和多模态任务上取得领先的结果。此外，ONE-PEACE还具备强大的多模态检索能力，能够完成图文音三模态之间的互相检索。

下游任务结果

视觉任务

Task	Image classificatio	Sematic Segmetatio	Object Detectio (w/o Object365)	Video Actio Recogitio
Dataset	Imageet-1K	ADE20K	COCO	Kietics 400
Split	val	val	val	val
Metric	Acc.	mIoU^ss / mIoU^ms	AP^box / AP^mask	Top-1 Acc. / Top-5 Acc.
ONE-PEACE	89.8	62.0 / 63.0	60.4 / 52.9	88.1 / 97.8

语音（-文本）任务

Task	Audio-Text Retrieval	Audio Classificatio	Audio Questio Aswerig
Dataset	AudioCaps	Clotho	ESC-50	FSD50K	VGGSoud (Audio Oly)	AVQA (Audio + Questio)
Split	test	evaluatio	full	eval	test	val
Metric	T2A R@1	A2T R@1	T2A R@1	A2T R@1	Zero-shot Acc.	MAP	Acc.	Acc.
ONE-PEACE	42.5	51.0	22.4	27.1	91.8	69.7	59.6	86.2

图文多模态任务

Task	Image-Text Retrieval (w/o rakig)	Visual Groudig	VQA	Visual Reasoig
Dataset	COCO	Flickr30K	RefCOCO	RefCOCO+	RefCOCOg	VQAv2	NLVR2
Split	test	test	val / testA / testB	val / testA / testB	val-u / test-u	test-dev / test-std	dev / test-P
Metric	I2T R@1	T2I R@1	I2T R@1	T2I R@1	Acc@0.5	Acc.	Acc.
ONE-PEACE	84.1	65.4	97.6	89.6	92.58 / 94.18 / 89.26	88.77 / 92.21 / 83.23	89.22 / 89.27	82.6 / 82.5	87.8 / 88.3

多模态检索

如下图所示，我们通过case展示了ONE-PEACE的音搜图，音+图搜图，以及音+文搜图的能力。

a2i

at2i

ai2i

模型局限性以及可能的偏差

模型主要使用开源的英文数据进行训练，因此中文的表征能力可能不太理想

ONE-PEACE-通用表征模型-英文-4B

技术信息

作品详情