IterVL2-1B

[? GitHub] [? Blog] [? IterVL 1.0 Paper] [? IterVL 1.5 Report]

[?️ Chat Demo] [? HF Demo] [? Quick Start] [? 中文解读] [? 魔搭社区 | 教程 ]

Itroductio

We are excited to aouce the release of IterVL 2.0, the latest additio to the IterVL series of multimodal large laguage models. IterVL 2.0 features a variety of istructio-tued models, ragig from 1 billio to 108 billio parameters. This repository cotais the istructio-tued IterVL2-1B model.

Compared to the state-of-the-art ope-source multimodal large laguage models, IterVL 2.0 surpasses most ope-source models. It demostrates competitive performace o par with proprietary commercial models across various capabilities, icludig documet ad chart comprehesio, ifographics QA, scee text uderstadig ad OCR tasks, scietific ad mathematical problem solvig, as well as cultural uderstadig ad itegrated multimodal capabilities.

IterVL 2.0 is traied with a 8k cotext widow ad utilizes traiig data cosistig of log texts, multiple images, ad videos, sigificatly improvig its ability to hadle these types of iputs compared to IterVL 1.5. For more details, please refer to our blog ad GitHub.

Model Name	Visio Part	Laguage Part	HF Lik	MS Lik
IterVL2-1B	IterViT-300M-448px	Qwe2-0.5B-Istruct	? lik	? lik
IterVL2-2B	IterViT-300M-448px	iterlm2-chat-1_8b	? lik	? lik
IterVL2-4B	IterViT-300M-448px	Phi-3-mii-128k-istruct	? lik	? lik
IterVL2-8B	IterViT-300M-448px	iterlm2_5-7b-chat	? lik	? lik
IterVL2-26B	IterViT-6B-448px-V1-5	iterlm2-chat-20b	? lik	? lik
IterVL2-40B	IterViT-6B-448px-V1-5	Nous-Hermes-2-Yi-34B	? lik	? lik
IterVL2-Llama3-76B	IterViT-6B-448px-V1-5	Hermes-2-Theta-Llama-3-70B	? lik	? lik

Model Details

IterVL 2.0 is a multimodal large laguage model series, featurig models of various sizes. For each size, we release istructio-tued models optimized for multimodal tasks. IterVL2-1B cosists of IterViT-300M-448px, a MLP projector, ad Qwe2-0.5B-Istruct.

Performace

Image Bechmarks

Bechmark	PaliGemma-3B	Mii-IterVL-2B-1-5	IterVL2-2B	IterVL2-1B
Model Size	2.9B	2.2B	2.2B	0.9B

DocVQA_test	-	85.0	86.9	81.7
ChartQA_test	-	74.8	76.2	72.9
IfoVQA_test	-	55.4	58.9	50.9
TextVQA_val	68.1	70.5	73.4	70.5
OCRBech	614	654	784	754
MME_sum	1686.1	1901.5	1876.8	1794.4
RealWorldQA	55.2	57.9	57.3	50.3
AI2D_test	68.3	69.8	74.1	64.1
MMMU_val	34.9	34.6 / 37.4	34.3 / 36.3	35.4 / 36.7
MMBech-EN_test	71.0	70.9	73.2	65.4
MMBech-CN_test	63.6	66.2	70.9	60.7
CCBech_dev	29.6	63.5	74.7	75.7
MMVet_GPT-4-0613	-	39.3	44.6	37.8
MMVet_GPT-4-Turbo	33.1	35.5	39.5	33.3
SEED-Image	69.6	69.8	71.6	65.6
HallBech_avg	32.2	37.5	37.9	33.4
MathVista_testmii	28.7	41.1	46.3	37.7
OpeCompass_avg	46.6	49.8	54.0	48.3

We simultaeously use IterVL ad VLMEvalKit repositories for model evaluatio. Specifically, the results reported for DocVQA, ChartQA, IfoVQA, TextVQA, MME, AI2D, MMBech, CCBech, MMVet, ad SEED-Image were tested usig the IterVL repository. OCRBech, RealWorldQA, HallBech, ad MathVista were evaluated usig the VLMEvalKit.
For MMMU, we report both the origial scores (left side: evaluated usig the IterVL codebase for IterVL series models, ad sourced from techical reports or webpages for other models) ad the VLMEvalKit scores (right side: collected from the OpeCompass leaderboard).
Please ote that evaluatig the same model usig differet testig toolkits like IterVL ad VLMEvalKit ca result i slight differeces, which is ormal. Updates to code versios ad variatios i eviromet ad hardware ca also cause mior discrepacies i results.

Video Bechmarks

Bechmark	VideoChat2-Phi3	Mii-IterVL-2B-1-5	IterVL2-2B	IterVL2-1B
Model Size	4B	2.2B	2.2B	0.9B

MVBech	55.1	37.0	60.2	57.9
MMBech-Video_8f	-	0.99	0.97	0.95
MMBech-Video_16f	-	1.04	1.03	0.98
Video-MME w/o subs	-	42.9	45.0	42.6
Video-MME w subs	-	44.7	47.3	44.7

We evaluate our models o MVBech ad Video-MME by extractig 16 frames from each video, ad each frame was resized to a 448x448 image.

Limitatios: Although we have made efforts to esure the safety of the model durig the traiig process ad to ecourage the model to geerate text that complies with ethical ad legal requiremets, the model may still produce uexpected outputs due to its size ad probabilistic geeratio paradigm. For example, the geerated resposes may cotai biases, discrimiatio, or other harmful cotet. Please do ot propagate such cotet. We are ot resposible for ay cosequeces resultig from the dissemiatio of harmful iformatio.

Quick Start

We provide a example code to ru IterVL2-1B usig trasformers.

We also welcome you to experiece the IterVL2 series models i our olie demo. Curretly, due to the limited GPU resources with public IP addresses, we ca oly deploy models up to a maximum of 26B. We will expad soo ad deploy larger models to the olie demo.

Please use trasformers==4.37.2 to esure the model works ormally.

import umpy as p
import torch
import torchvisio.trasforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvisio.trasforms.fuctioal import IterpolatioMode
from trasformers import AutoModel, AutoTokeizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_trasform(iput_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    trasform = T.Compose([
        T.Lambda(lambda img: img.covert('RGB') if img.mode != 'RGB' else img),
        T.Resize((iput_size, iput_size), iterpolatio=IterpolatioMode.BICUBIC),
        T.ToTesor(),
        T.Normalize(mea=MEAN, std=STD)
    ])
    retur trasform


def fid_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('if')
    best_ratio = (1, 1)
    area = width * height
    for ratio i target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    retur best_ratio


def dyamic_preprocess(image, mi_um=1, max_um=6, image_size=448, use_thumbail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existig image aspect ratio
    target_ratios = set(
        (i, j) for  i rage(mi_um, max_um + 1) for i i rage(1,  + 1) for j i rage(1,  + 1) if
        i * j <= max_um ad i * j >= mi_um)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # fid the closest aspect ratio to the target
    target_aspect_ratio = fid_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width ad height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i i rage(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.apped(split_img)
    assert le(processed_images) == blocks
    if use_thumbail ad le(processed_images) != 1:
        thumbail_img = image.resize((image_size, image_size))
        processed_images.apped(thumbail_img)
    retur processed_images


def load_image(image_file, iput_size=448, max_um=6):
    image = Image.ope(image_file).covert('RGB')
    trasform = build_trasform(iput_size=iput_size)
    images = dyamic_preprocess(image, image_size=iput_size, use_thumbail=True, max_um=max_um)
    pixel_values = [trasform(image) for image i images]
    pixel_values = torch.stack(pixel_values)
    retur pixel_values


path = 'OpeGVLab/IterVL2-1B'
model = AutoModel.from_pretraied(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()

tokeizer = AutoTokeizer.from_pretraied(path, trust_remote_code=True)
# set the max umber of tiles i `max_um`
pixel_values = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()

geeratio_cofig = dict(
    um_beams=1,
    max_ew_tokes=1024,
    do_sample=False,
)

# pure-text coversatio (纯文本对话)
questio = 'Hello, who are you?'
respose, history = model.chat(tokeizer, Noe, questio, geeratio_cofig, history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

questio = 'Ca you tell me a story?'
respose, history = model.chat(tokeizer, Noe, questio, geeratio_cofig, history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

# sigle-image sigle-roud coversatio (单图单轮对话)
questio = '<image>\Please describe the image shortly.'
respose = model.chat(tokeizer, pixel_values, questio, geeratio_cofig)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

# sigle-image multi-roud coversatio (单图多轮对话)
questio = '<image>\Please describe the image i detail.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig, history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

questio = 'Please write a poem accordig to the image.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig, history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

# multi-image multi-roud coversatio, combied images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questio = '<image>\Describe the two images i detail.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
                               history=Noe, retur_history=True)

questio = 'What are the similarities ad differeces betwee these two images.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
                               history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

# multi-image multi-roud coversatio, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
um_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

questio = 'Image-1: <image>\Image-2: <image>\Describe the two images i detail.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
                               um_patches_list=um_patches_list,
                               history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

questio = 'What are the similarities ad differeces betwee these two images.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
                               um_patches_list=um_patches_list,
                               history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

# batch iferece, sigle image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_um=6).to(torch.bfloat16).cuda()
um_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questios = ['<image>\Describe the image i detail.'] * le(um_patches_list)
resposes = model.batch_chat(tokeizer, pixel_values,
                             um_patches_list=um_patches_list,
                             questios=questios,
                             geeratio_cofig=geeratio_cofig)
for questio, respose i zip(questios, resposes):
    prit(f'User: {questio}')
    prit(f'Assistat: {respose}')

# video multi-roud coversatio (视频多轮对话)
def get_idex(boud, fps, max_frame, first_idx=0, um_segmets=32):
    if boud:
        start, ed = boud[0], boud[1]
    else:
        start, ed = -100000, 100000
    start_idx = max(first_idx, roud(start * fps))
    ed_idx = mi(roud(ed * fps), max_frame)
    seg_size = float(ed_idx - start_idx) / um_segmets
    frame_idices = p.array([
        it(start_idx + (seg_size / 2) + p.roud(seg_size * idx))
        for idx i rage(um_segmets)
    ])
    retur frame_idices

def load_video(video_path, boud=Noe, iput_size=448, max_um=1, um_segmets=32):
    vr = VideoReader(video_path, ctx=cpu(0), um_threads=1)
    max_frame = le(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, um_patches_list = [], []
    trasform = build_trasform(iput_size=iput_size)
    frame_idices = get_idex(boud, fps, max_frame, first_idx=0, um_segmets=um_segmets)
    for frame_idex i frame_idices:
        img = Image.fromarray(vr[frame_idex].asumpy()).covert('RGB')
        img = dyamic_preprocess(img, image_size=iput_size, use_thumbail=True, max_um=max_um)
        pixel_values = [trasform(tile) for tile i img]
        pixel_values = torch.stack(pixel_values)
        um_patches_list.apped(pixel_values.shape[0])
        pixel_values_list.apped(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    retur pixel_values, um_patches_list


video_path = './examples/red-pada.mp4'
# pixel_values, um_patches_list = load_video(video_path, um_segmets=32, max_um=1)
pixel_values, um_patches_list = load_video(video_path, um_segmets=8, max_um=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.joi([f'Frame{i+1}: <image>\' for i i rage(le(um_patches_list))])
questio = video_prefix + 'What is the red pada doig?'
# Frame1: <image>\Frame2: <image>\...\Frame31: <image>\{questio}
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
                               um_patches_list=um_patches_list,
                               history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

questio = 'Describe this video i detail. Do\'t repeat.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
                               um_patches_list=um_patches_list,
                               history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')

Streamig output

Besides this method, you ca also use the followig code to get streamed output.

from trasformers import TextIteratorStreamer
from threadig import Thread

# Iitialize the streamer
streamer = TextIteratorStreamer(tokeizer, skip_prompt=True, skip_special_tokes=True, timeout=10)
# Defie the geeratio cofiguratio
geeratio_cofig = dict(um_beams=1, max_ew_tokes=1024, do_sample=False, streamer=streamer)
# Start the model chat i a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokeizer=tokeizer, pixel_values=pixel_values, questio=questio,
    history=Noe, retur_history=False, geeratio_cofig=geeratio_cofig,
))
thread.start()

# Iitialize a empty strig to store the geerated text
geerated_text = ''
# Loop through the streamer to get the ew text as it is geerated
for ew_text i streamer:
    if ew_text == model.cov_template.sep:
        break
    geerated_text += ew_text
    prit(ew_text, ed='', flush=True)  # Prit each ew chuk of geerated text o the same lie

Fietue

SWIFT from ModelScope commuity has supported the fie-tuig (Image/Video) of IterVL, please check this lik for more details.

Deploymet

LMDeploy

Warig: This model is ot yet supported by LMDeploy.

vLLM

TODO

Ollama

TODO

Licese

This project is released uder the MIT licese, while Qwe2 is licesed uder the Togyi Qiawe LICENSE.

Citatio

If you fid this project useful i your research, please cosider citig:

@article{che2023itervl,
  title={IterVL: Scalig up Visio Foudatio Models ad Aligig for Geeric Visual-Liguistic Tasks},
  author={Che, Zhe ad Wu, Jiaa ad Wag, Wehai ad Su, Weijie ad Che, Guo ad Xig, Se ad Zhog, Muya ad Zhag, Qiglog ad Zhu, Xizhou ad Lu, Lewei ad Li, Bi ad Luo, Pig ad Lu, Tog ad Qiao, Yu ad Dai, Jifeg},
  joural={arXiv preprit arXiv:2312.14238},
  year={2023}
}
@article{che2024far,
  title={How Far Are We to GPT-4V? Closig the Gap to Commercial Multimodal Models with Ope-Source Suites},
  author={Che, Zhe ad Wag, Weiyu ad Tia, Hao ad Ye, Sheglog ad Gao, Zhagwei ad Cui, Erfei ad Tog, Wewe ad Hu, Kogzhi ad Luo, Jiapeg ad Ma, Zheg ad others},
  joural={arXiv preprit arXiv:2404.16821},
  year={2024}
}

简介

我们很高兴宣布 IterVL 2.0 的发布，这是 IterVL 系列多模态大语言模型的最新版本。IterVL 2.0 提供了多种指令微调的模型，参数从 10 亿到 1080 亿不等。此仓库包含经过指令微调的 IterVL2-1B 模型。

与最先进的开源多模态大语言模型相比，IterVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力，包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决，以及文化理解和综合多模态能力。

IterVL 2.0 使用 8k 上下文窗口进行训练，训练数据包含长文本、多图和视频数据，与 IterVL 1.5 相比，其处理这些类型输入的能力显著提高。更多详细信息，请参阅我们的博客和 GitHub。

模型名称	视觉部分	语言部分	HF 链接	MS 链接
IterVL2-1B	IterViT-300M-448px	Qwe2-0.5B-Istruct	? lik	? lik
IterVL2-2B	IterViT-300M-448px	iterlm2-chat-1_8b	? lik	? lik
IterVL2-4B	IterViT-300M-448px	Phi-3-mii-128k-istruct	? lik	? lik
IterVL2-8B	IterViT-300M-448px	iterlm2_5-7b-chat	? lik	? lik
IterVL2-26B	IterViT-6B-448px-V1-5	iterlm2-chat-20b	? lik	? lik
IterVL2-40B	IterViT-6B-448px-V1-5	Nous-Hermes-2-Yi-34B	? lik	? lik
IterVL2-Llama3-76B	IterViT-6B-448px-V1-5	Hermes-2-Theta-Llama-3-70B	? lik	? lik

模型细节

IterVL 2.0 是一个多模态大语言模型系列，包含各种规模的模型。对于每个规模的模型，我们都会发布针对多模态任务优化的指令微调模型。IterVL2-1B 包含 IterViT-300M-448px、一个 MLP 投影器和 Qwe2-0.5B-Istruct。

性能测试

图像相关评测

评测数据集	PaliGemma-3B	Mii-IterVL-2B-1-5	IterVL2-2B	IterVL2-1B
模型大小	2.9B	2.2B	2.2B	0.9B

DocVQA_test	-	85.0	86.9	81.7
ChartQA_test	-	74.8	76.2	72.9
IfoVQA_test	-	55.4	58.9	50.9
TextVQA_val	68.1	70.5	73.4	70.5
OCRBech	614	654	784	754
MME_sum	1686.1	1901.5	1876.8	1794.4
RealWorldQA	55.2	57.9	57.3	50.3
AI2D_test	68.3	69.8	74.1	64.1
MMMU_val	34.9	34.6 / 37.4	34.3 / 36.3	35.4 / 36.7
MMBech-EN_test	71.0	70.9	73.2	65.4
MMBech-CN_test	63.6	66.2	70.9	60.7
CCBech_dev	29.6	63.5	74.7	75.7
MMVet_GPT-4-0613	-	39.3	44.6	37.8
MMVet_GPT-4-Turbo	33.1	35.5	39.5	37.3
SEED-Image	69.6	69.8	71.6	65.6
HallBech_avg	32.2	37.5	37.9	33.4
MathVista_testmii	28.7	41.1	46.3	37.7
OpeCompass_avg	46.6	49.8	54.0	48.3

我们同时使用 IterVL 和 VLMEvalKit 仓库进行模型评估。具体来说，DocVQA、ChartQA、IfoVQA、TextVQA、MME、AI2D、MMBech、CCBech、MMVet 和 SEED-Image 的结果是使用 IterVL 仓库测试的。OCRBech、RealWorldQA、HallBech 和 MathVista 是使用 VLMEvalKit 进行评估的。
对于MMMU，我们报告了原始分数（左侧：IterVL系列模型使用IterVL代码库评测，其他模型的分数来自其技术报告或网页）和VLMEvalKit分数（右侧：从OpeCompass排行榜收集）。
请注意，使用不同的测试工具包（如 IterVL 和 VLMEvalKit）评估同一模型可能会导致细微差异，这是正常的。代码版本的更新、环境和硬件的变化也可能导致结果的微小差异。

视频相关评测

评测数据集	VideoChat2-Phi3	Mii-IterVL-2B-1-5	IterVL2-2B	IterVL2-1B
模型大小	4B	2.2B	2.2B	0.9B

MVBech	55.1	37.0	60.2	57.9
MMBech-Video_8f	-	0.99	0.97	0.95
MMBech-Video_16f	-	1.04	1.03	0.98
Video-MME w/o subs	-	42.9	45.0	42.6
Video-MME w subs	-	44.7	47.3	44.7

我们通过从每个视频中提取 16 帧来评估我们的模型在 MVBech 和 Video-MME 上的性能，每个视频帧被调整为 448x448 的图像。

限制：尽管在训练过程中我们非常注重模型的安全性，尽力促使模型输出符合伦理和法律要求的文本，但受限于模型大小以及概率生成范式，模型可能会产生各种不符合预期的输出，例如回复内容包含偏见、歧视等有害内容，请勿传播这些内容。由于传播不良信息导致的任何后果，本项目不承担责任。

快速启动

我们提供了一个示例代码，用于使用 trasformers 运行 IterVL2-1B。

我们也欢迎你在我们的在线demo中体验IterVL2的系列模型。目前，由于具备公网IP地址的GPU资源有限，我们目前只能部署最大到26B的模型。我们会在不久之后进行扩容，把更大的模型部署到在线demo上，敬请期待。

请使用 trasformers==4.37.2 以确保模型正常运行。

示例代码请点击这里。

微调

来自ModelScope社区的SWIFT已经支持对IterVL进行微调（图像/视频），详情请查看此链接。

部署

LMDeploy

注意：此模型尚未被 LMDeploy 支持。

vLLM

TODO

Ollama

TODO

开源许可证

该项目采用 MIT 许可证发布，而 Qwe2 则采用通义千问许可证。

引用

如果您发现此项目对您的研究有用，可以考虑引用我们的论文：

@article{che2023itervl,
  title={IterVL: Scalig up Visio Foudatio Models ad Aligig for Geeric Visual-Liguistic Tasks},
  author={Che, Zhe ad Wu, Jiaa ad Wag, Wehai ad Su, Weijie ad Che, Guo ad Xig, Se ad Zhog, Muya ad Zhag, Qiglog ad Zhu, Xizhou ad Lu, Lewei ad Li, Bi ad Luo, Pig ad Lu, Tog ad Qiao, Yu ad Dai, Jifeg},
  joural={arXiv preprit arXiv:2312.14238},
  year={2023}
}
@article{che2024far,
  title={How Far Are We to GPT-4V? Closig the Gap to Commercial Multimodal Models with Ope-Source Suites},
  author={Che, Zhe ad Wag, Weiyu ad Tia, Hao ad Ye, Sheglog ad Gao, Zhagwei ad Cui, Erfei ad Tog, Wewe ad Hu, Kogzhi ad Luo, Jiapeg ad Ma, Zheg ad others},
  joural={arXiv preprit arXiv:2404.16821},
  year={2024}
}

InternVL2-1B

技术信息

作品详情

IterVL2-1B

Itroductio

Model Details

Performace

Image Bechmarks

Video Bechmarks

Quick Start

Streamig output

Fietue

Deploymet

LMDeploy

vLLM

Ollama

Licese

Citatio

简介

模型细节

性能测试

图像相关评测

视频相关评测

快速启动

微调

部署

LMDeploy

vLLM

Ollama

开源许可证

引用

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐