CogVLM2-Video-Llama3-Base
Read this i Eglish
介绍
CogVLM2-Video 在多个视频问答任务上实现了最先进的性能。能够实现一分钟内的视频理解。 我们提供了两个示例视频,分别展现了
CogVLM2-Video 的 视频理解和时间序列定位能力。
榜单性能
下图显示了 CogVLM2-Video
在 MVBech、VideoChatGPT-Bech
和 Zero-shot VideoQA 数据集 (MSVD-QA、MSRVTT-QA、ActivityNet-QA) 上的性能。

其中 VCG 指的是 VideoChatGPTBech,ZS 指的是零样本 VideoQA 数据集,MV-* 指的是 MVBech 中的主要类别。
具体榜单测试数据如下:
Models |
VCG-AVG |
VCG-CI |
VCG-DO |
VCG-CU |
VCG-TU |
VCG-CO |
ZS-AVG |
IG-VLM GPT4V |
3.17 |
3.40 |
2.80 |
3.61 |
2.89 |
3.13 |
65.70 |
ST-LLM |
3.15 |
3.23 |
3.05 |
3.74 |
2.93 |
2.81 |
62.90 |
ShareGPT4Video |
N/A |
N/A |
N/A |
N/A |
N/A |
N/A |
46.50 |
VideoGPT+ |
3.28 |
3.27 |
3.18 |
3.74 |
2.83 |
3.39 |
61.20 |
VideoChat2HDmistral |
3.10 |
3.40 |
2.91 |
3.72 |
2.65 |
2.84 |
57.70 |
PLLaVA-34B |
3.32 |
3.60 |
3.20 |
3.90 |
2.67 |
3.25 |
68.10 |
CogVLM2-Video |
3.41 |
3.49 |
3.46 |
3.87 |
2.98 |
3.23 |
66.60 |
CogVLM2-Video 在 MVBech 数据集上的表现
Models |
AVG |
AA |
AC |
AL |
AP |
AS |
CO |
CI |
EN |
ER |
FA |
FP |
MA |
MC |
MD |
OE |
OI |
OS |
ST |
SC |
UA |
IG-VLM GPT4V |
43.7 |
72.0 |
39.0 |
40.5 |
63.5 |
55.5 |
52.0 |
11.0 |
31.0 |
59.0 |
46.5 |
47.5 |
22.5 |
12.0 |
12.0 |
18.5 |
59.0 |
29.5 |
83.5 |
45.0 |
73.5 |
ST-LLM |
54.9 |
84.0 |
36.5 |
31.0 |
53.5 |
66.0 |
46.5 |
58.5 |
34.5 |
41.5 |
44.0 |
44.5 |
78.5 |
56.5 |
42.5 |
80.5 |
73.5 |
38.5 |
86.5 |
43.0 |
58.5 |
ShareGPT4Video |
51.2 |
79.5 |
35.5 |
41.5 |
39.5 |
49.5 |
46.5 |
51.5 |
28.5 |
39.0 |
40.0 |
25.5 |
75.0 |
62.5 |
50.5 |
82.5 |
54.5 |
32.5 |
84.5 |
51.0 |
54.5 |
VideoGPT+ |
58.7 |
83.0 |
39.5 |
34.0 |
60.0 |
69.0 |
50.0 |
60.0 |
29.5 |
44.0 |
48.5 |
53.0 |
90.5 |
71.0 |
44.0 |
85.5 |
75.5 |
36.0 |
89.5 |
45.0 |
66.5 |
VideoChat2HDmistral |
62.3 |
79.5 |
60.0 |
87.5 |
50.0 |
68.5 |
93.5 |
71.5 |
36.5 |
45.0 |
49.5 |
87.0 |
40.0 |
76.0 |
92.0 |
53.0 |
62.0 |
45.5 |
36.0 |
44.0 |
69.5 |
PLLaVA-34B |
58.1 |
82.0 |
40.5 |
49.5 |
53.0 |
67.5 |
66.5 |
59.0 |
39.5 |
63.5 |
47.0 |
50.0 |
70.0 |
43.0 |
37.5 |
68.5 |
67.5 |
36.5 |
91.0 |
51.5 |
79.0 |
CogVLM2-Video |
62.3 |
85.5 |
41.5 |
31.5 |
65.5 |
79.5 |
58.5 |
77.0 |
28.5 |
42.5 |
54.0 |
57.0 |
91.5 |
73.0 |
48.0 |
91.0 |
78.0 |
36.0 |
91.5 |
47.0 |
68.5 |
评估和复现
我们遵循以前的研究来评估我们模型的性能。在不同的基准测试中,我们为每个基准测试制作特定于任务的提示:
``` pytho
For MVBech
prompt = f"Carefully watch the video ad pay attetio to the cause ad sequece of evets, the detail ad movemet of objects, ad the actio ad pose of persos. Based o your observatios, select the best optio that accurately addresses the questio.\ " + f"{prompt.replace('Short Aswer.', '')}\" + "Short Aswer:"
For VideoChatGPT-Bech
prompt = f"Carefully watch the video ad pay attetio to the cause ad sequece of evets, the detail ad movemet of objects, ad the actio ad pose of persos. Based o your observatios, comprehesively aswer the followig questio. Your aswer should be log ad cover all the related aspects\ " + f"{prompt.replace('Short Aswer.', '')}\" + "Aswer:"
For Zero-shot VideoQA
prompt = f"The iput cosists of a sequece of key frames from a video. Aswer the questio comprehesively icludig all the possible verbs ad ous that ca discribe the evets, followed by sigificat evets, characters, or objects that appear throughout the frames.\ " + f"{prompt.replace('Short Aswer.', '')}\" + "Aswer:"
```
有关评估代码,请参阅 PLLaVA 中的 评估脚本。
快速调用
本仓库为 base
版本模型,不支持对话。
您可以在我们的 github 中快速安装对应的 Pytho包 依赖和运行模型推理。
模型协议
此模型根据
CogVLM2 LICENSE
发布。对于使用 Meta Llama 3 构建的模型,还请遵守
LLAMA3_LICENSE。
引用
我们即将发布技术报告,尽情期待。
评论