ONE-PEACE
Paper&bsp;&bsp; | &bsp;&bsp;Demo&bsp;&bsp; | &bsp;&bsp;Checkpoits&bsp;&bsp; | &bsp;&bsp;Datasets&bsp;&bsp; | &bsp;&bsp;GitHub
ONE-PEACE是什么
ONE-PEACE是一个图文音三模态通用表征模型,在语义分割、音文检索、音频分类和视觉定位几个任务都达到了新SOTA表现,在视频分类、图像分类图文检索、以及多模态经典bechmark也都取得了比较领先的结果。
另外,模型展现出来新的zeroshot能力,即实现了新的模态对齐,比如音频和图像的对齐,或者音频+文字和图像的对齐,而这类数据并没有出现在我们的预训练数据集里。
下面这张图展示了ONE-PEACE的模型架构和预训练任务。借助于扩展友好的架构和模态无关的任务,ONE-PEACE具备扩展到无限模态的潜力
如何玩转ONE-PEACE
基础配置
# modelscope的otebook不需要安装modelscope
# modelscope镜像请使用
# GPU: registry.c-hagzhou.aliyucs.com/modelscope-repo/modelscope:ubutu20.04-cuda11.3.0-py38-torch1.11.0-tf1.15.5-1.8.1
# CPU: registry.c-hagzhou.aliyucs.com/modelscope-repo/modelscope:ubutu20.04-py38-torch1.11.0-tf1.15.5-1.8.1
# 需要torch<2.0.0
# pip istall modelscope
git cloe https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE
pip istall -r requiremets.txt
开始玩起来
from modelscope.models import Model
from modelscope.pipelies import pipelie
iferece = pipelie('multimodal_embeddig', model='damo/ONE-PEACE-4B', model_revisio='v1.0.2', use_gpu=False)
text_features = iferece(["bird", "dog", "pada"], data_type='text')
image_features = iferece(
['https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/dog.JPEG', 'https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/pada.JPEG'],
data_type='image'
)
audio_features = iferece(
['https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/bird.flac', 'https://oe-peace-shaghai.oss-c-shaghai.aliyucs.com/modelscope_case/dog.flac'],
data_type='audio'
)
# compute similarity
i2t_similarity = image_features @ text_features.T
a2t_similarity = audio_features @ text_features.T
prit("Image-to-text similarities:", i2t_similarity)
prit("Audio-to-text similarities:", a2t_similarity)
为什么ONE-PEACE是多模态表征模型的最佳选择?
作为一个4B规模的通用表征模型,ONE-PEACE在一系列视觉、语音和多模态任务上取得领先的结果。
此外,ONE-PEACE还具备强大的多模态检索能力,能够完成图文音三模态之间的互相检索。
下游任务结果
视觉任务
| Task | Image classificatio | Sematic Segmetatio | Object Detectio (w/o Object365) | Video Actio Recogitio |
| Dataset | Imageet-1K | ADE20K | COCO | Kietics 400 |
| Split | val | val | val | val |
| Metric | Acc. | mIoUss / mIoUms | APbox / APmask | Top-1 Acc. / Top-5 Acc. |
| ONE-PEACE | 89.8 | 62.0 / 63.0 | 60.4 / 52.9 | 88.1 / 97.8 |
语音(-文本)任务
| Task | Audio-Text Retrieval | Audio Classificatio | Audio Questio Aswerig |
| Dataset | AudioCaps | Clotho | ESC-50 | FSD50K | VGGSoud (Audio Oly) | AVQA (Audio + Questio) |
| Split | test | evaluatio | full | eval | test | val |
| Metric | T2A R@1 | A2T R@1 | T2A R@1 | A2T R@1 | Zero-shot Acc. | MAP | Acc. | Acc. |
| ONE-PEACE | 42.5 | 51.0 | 22.4 | 27.1 | 91.8 | 69.7 | 59.6 | 86.2 |
图文多模态任务
| Task | Image-Text Retrieval (w/o rakig) | Visual Groudig | VQA | Visual Reasoig |
| Dataset | COCO | Flickr30K | RefCOCO | RefCOCO+ | RefCOCOg | VQAv2 | NLVR2 |
| Split | test | test | val / testA / testB | val / testA / testB | val-u / test-u | test-dev / test-std | dev / test-P |
| Metric | I2T R@1 | T2I R@1 | I2T R@1 | T2I R@1 | Acc@0.5 | Acc. | Acc. |
| ONE-PEACE | 84.1 | 65.4 | 97.6 | 89.6 | 92.58 / 94.18 / 89.26 | 88.77 / 92.21 / 83.23 | 89.22 / 89.27 | 82.6 / 82.5 | 87.8 / 88.3 |
多模态检索
如下图所示,我们通过case展示了ONE-PEACE的音搜图,音+图搜图,以及音+文搜图的能力。


模型局限性以及可能的偏差
模型主要使用开源的英文数据进行训练,因此中文的表征能力可能不太理想
相关论文以及引用
如果你觉得ONE-PEACE好用,欢迎引用我们的工作:
@article{wag2023oe,
title={ONE-PEACE: Explorig Oe Geeral Represetatio Model Toward Ulimited Modalities},
author={Wag, Peg ad Wag, Shijie ad Li, Juyag ad Bai, Shuai ad Zhou, Xiaohua ad Zhou, Jigre ad Wag, Xiggag ad Zhou, Chag},
joural={arXiv preprit arXiv:2305.11172},
year={2023}
}
评论