IterVL2-1B
[? GitHub] [? Blog] [? IterVL 1.0 Paper] [? IterVL 1.5 Report]
[?️ Chat Demo] [? HF Demo] [? Quick Start] [? 中文解读] [? 魔搭社区 | 教程 ]
切换至中文版
Itroductio
We are excited to aouce the release of IterVL 2.0, the latest additio to the IterVL series of multimodal large laguage models. IterVL 2.0 features a variety of istructio-tued models, ragig from 1 billio to 108 billio parameters. This repository cotais the istructio-tued IterVL2-1B model.
Compared to the state-of-the-art ope-source multimodal large laguage models, IterVL 2.0 surpasses most ope-source models. It demostrates competitive performace o par with proprietary commercial models across various capabilities, icludig documet ad chart comprehesio, ifographics QA, scee text uderstadig ad OCR tasks, scietific ad mathematical problem solvig, as well as cultural uderstadig ad itegrated multimodal capabilities.
IterVL 2.0 is traied with a 8k cotext widow ad utilizes traiig data cosistig of log texts, multiple images, ad videos, sigificatly improvig its ability to hadle these types of iputs compared to IterVL 1.5. For more details, please refer to our blog ad GitHub.
Model Details
IterVL 2.0 is a multimodal large laguage model series, featurig models of various sizes. For each size, we release istructio-tued models optimized for multimodal tasks. IterVL2-1B cosists of IterViT-300M-448px, a MLP projector, ad Qwe2-0.5B-Istruct.
Performace
Image Bechmarks
Bechmark |
PaliGemma-3B |
Mii-IterVL-2B-1-5 |
IterVL2-2B |
IterVL2-1B |
Model Size |
2.9B |
2.2B |
2.2B |
0.9B |
|
|
|
|
|
DocVQAtest |
- |
85.0 |
86.9 |
81.7 |
ChartQAtest |
- |
74.8 |
76.2 |
72.9 |
IfoVQAtest |
- |
55.4 |
58.9 |
50.9 |
TextVQAval |
68.1 |
70.5 |
73.4 |
70.5 |
OCRBech |
614 |
654 |
784 |
754 |
MMEsum |
1686.1 |
1901.5 |
1876.8 |
1794.4 |
RealWorldQA |
55.2 |
57.9 |
57.3 |
50.3 |
AI2Dtest |
68.3 |
69.8 |
74.1 |
64.1 |
MMMUval |
34.9 |
34.6 / 37.4 |
34.3 / 36.3 |
35.4 / 36.7 |
MMBech-ENtest |
71.0 |
70.9 |
73.2 |
65.4 |
MMBech-CNtest |
63.6 |
66.2 |
70.9 |
60.7 |
CCBechdev |
29.6 |
63.5 |
74.7 |
75.7 |
MMVetGPT-4-0613 |
- |
39.3 |
44.6 |
37.8 |
MMVetGPT-4-Turbo |
33.1 |
35.5 |
39.5 |
33.3 |
SEED-Image |
69.6 |
69.8 |
71.6 |
65.6 |
HallBechavg |
32.2 |
37.5 |
37.9 |
33.4 |
MathVistatestmii |
28.7 |
41.1 |
46.3 |
37.7 |
OpeCompassavg |
46.6 |
49.8 |
54.0 |
48.3 |
We simultaeously use IterVL ad VLMEvalKit repositories for model evaluatio. Specifically, the results reported for DocVQA, ChartQA, IfoVQA, TextVQA, MME, AI2D, MMBech, CCBech, MMVet, ad SEED-Image were tested usig the IterVL repository. OCRBech, RealWorldQA, HallBech, ad MathVista were evaluated usig the VLMEvalKit.
For MMMU, we report both the origial scores (left side: evaluated usig the IterVL codebase for IterVL series models, ad sourced from techical reports or webpages for other models) ad the VLMEvalKit scores (right side: collected from the OpeCompass leaderboard).
Please ote that evaluatig the same model usig differet testig toolkits like IterVL ad VLMEvalKit ca result i slight differeces, which is ormal. Updates to code versios ad variatios i eviromet ad hardware ca also cause mior discrepacies i results.
Video Bechmarks
Bechmark |
VideoChat2-Phi3 |
Mii-IterVL-2B-1-5 |
IterVL2-2B |
IterVL2-1B |
Model Size |
4B |
2.2B |
2.2B |
0.9B |
|
|
|
|
|
MVBech |
55.1 |
37.0 |
60.2 |
57.9 |
MMBech-Video8f |
- |
0.99 |
0.97 |
0.95 |
MMBech-Video16f |
- |
1.04 |
1.03 |
0.98 |
Video-MME w/o subs |
- |
42.9 |
45.0 |
42.6 |
Video-MME w subs |
- |
44.7 |
47.3 |
44.7 |
- We evaluate our models o MVBech ad Video-MME by extractig 16 frames from each video, ad each frame was resized to a 448x448 image.
Limitatios: Although we have made efforts to esure the safety of the model durig the traiig process ad to ecourage the model to geerate text that complies with ethical ad legal requiremets, the model may still produce uexpected outputs due to its size ad probabilistic geeratio paradigm. For example, the geerated resposes may cotai biases, discrimiatio, or other harmful cotet. Please do ot propagate such cotet. We are ot resposible for ay cosequeces resultig from the dissemiatio of harmful iformatio.
Quick Start
We provide a example code to ru IterVL2-1B usig trasformers
.
We also welcome you to experiece the IterVL2 series models i our olie demo. Curretly, due to the limited GPU resources with public IP addresses, we ca oly deploy models up to a maximum of 26B. We will expad soo ad deploy larger models to the olie demo.
Please use trasformers==4.37.2 to esure the model works ormally.
import umpy as p
import torch
import torchvisio.trasforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvisio.trasforms.fuctioal import IterpolatioMode
from trasformers import AutoModel, AutoTokeizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_trasform(iput_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
trasform = T.Compose([
T.Lambda(lambda img: img.covert('RGB') if img.mode != 'RGB' else img),
T.Resize((iput_size, iput_size), iterpolatio=IterpolatioMode.BICUBIC),
T.ToTesor(),
T.Normalize(mea=MEAN, std=STD)
])
retur trasform
def fid_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('if')
best_ratio = (1, 1)
area = width * height
for ratio i target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
retur best_ratio
def dyamic_preprocess(image, mi_um=1, max_um=6, image_size=448, use_thumbail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existig image aspect ratio
target_ratios = set(
(i, j) for i rage(mi_um, max_um + 1) for i i rage(1, + 1) for j i rage(1, + 1) if
i * j <= max_um ad i * j >= mi_um)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# fid the closest aspect ratio to the target
target_aspect_ratio = fid_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width ad height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i i rage(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.apped(split_img)
assert le(processed_images) == blocks
if use_thumbail ad le(processed_images) != 1:
thumbail_img = image.resize((image_size, image_size))
processed_images.apped(thumbail_img)
retur processed_images
def load_image(image_file, iput_size=448, max_um=6):
image = Image.ope(image_file).covert('RGB')
trasform = build_trasform(iput_size=iput_size)
images = dyamic_preprocess(image, image_size=iput_size, use_thumbail=True, max_um=max_um)
pixel_values = [trasform(image) for image i images]
pixel_values = torch.stack(pixel_values)
retur pixel_values
path = 'OpeGVLab/IterVL2-1B'
model = AutoModel.from_pretraied(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
tokeizer = AutoTokeizer.from_pretraied(path, trust_remote_code=True)
# set the max umber of tiles i `max_um`
pixel_values = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
geeratio_cofig = dict(
um_beams=1,
max_ew_tokes=1024,
do_sample=False,
)
# pure-text coversatio (纯文本对话)
questio = 'Hello, who are you?'
respose, history = model.chat(tokeizer, Noe, questio, geeratio_cofig, history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
questio = 'Ca you tell me a story?'
respose, history = model.chat(tokeizer, Noe, questio, geeratio_cofig, history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
# sigle-image sigle-roud coversatio (单图单轮对话)
questio = '<image>\Please describe the image shortly.'
respose = model.chat(tokeizer, pixel_values, questio, geeratio_cofig)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
# sigle-image multi-roud coversatio (单图多轮对话)
questio = '<image>\Please describe the image i detail.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig, history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
questio = 'Please write a poem accordig to the image.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig, history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
# multi-image multi-roud coversatio, combied images (多图多轮对话,拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
questio = '<image>\Describe the two images i detail.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
history=Noe, retur_history=True)
questio = 'What are the similarities ad differeces betwee these two images.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
# multi-image multi-roud coversatio, separate images (多图多轮对话,独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
um_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
questio = 'Image-1: <image>\Image-2: <image>\Describe the two images i detail.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
um_patches_list=um_patches_list,
history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
questio = 'What are the similarities ad differeces betwee these two images.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
um_patches_list=um_patches_list,
history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
# batch iferece, sigle image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_um=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_um=6).to(torch.bfloat16).cuda()
um_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
questios = ['<image>\Describe the image i detail.'] * le(um_patches_list)
resposes = model.batch_chat(tokeizer, pixel_values,
um_patches_list=um_patches_list,
questios=questios,
geeratio_cofig=geeratio_cofig)
for questio, respose i zip(questios, resposes):
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
# video multi-roud coversatio (视频多轮对话)
def get_idex(boud, fps, max_frame, first_idx=0, um_segmets=32):
if boud:
start, ed = boud[0], boud[1]
else:
start, ed = -100000, 100000
start_idx = max(first_idx, roud(start * fps))
ed_idx = mi(roud(ed * fps), max_frame)
seg_size = float(ed_idx - start_idx) / um_segmets
frame_idices = p.array([
it(start_idx + (seg_size / 2) + p.roud(seg_size * idx))
for idx i rage(um_segmets)
])
retur frame_idices
def load_video(video_path, boud=Noe, iput_size=448, max_um=1, um_segmets=32):
vr = VideoReader(video_path, ctx=cpu(0), um_threads=1)
max_frame = le(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, um_patches_list = [], []
trasform = build_trasform(iput_size=iput_size)
frame_idices = get_idex(boud, fps, max_frame, first_idx=0, um_segmets=um_segmets)
for frame_idex i frame_idices:
img = Image.fromarray(vr[frame_idex].asumpy()).covert('RGB')
img = dyamic_preprocess(img, image_size=iput_size, use_thumbail=True, max_um=max_um)
pixel_values = [trasform(tile) for tile i img]
pixel_values = torch.stack(pixel_values)
um_patches_list.apped(pixel_values.shape[0])
pixel_values_list.apped(pixel_values)
pixel_values = torch.cat(pixel_values_list)
retur pixel_values, um_patches_list
video_path = './examples/red-pada.mp4'
# pixel_values, um_patches_list = load_video(video_path, um_segmets=32, max_um=1)
pixel_values, um_patches_list = load_video(video_path, um_segmets=8, max_um=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.joi([f'Frame{i+1}: <image>\' for i i rage(le(um_patches_list))])
questio = video_prefix + 'What is the red pada doig?'
# Frame1: <image>\Frame2: <image>\...\Frame31: <image>\{questio}
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
um_patches_list=um_patches_list,
history=Noe, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
questio = 'Describe this video i detail. Do\'t repeat.'
respose, history = model.chat(tokeizer, pixel_values, questio, geeratio_cofig,
um_patches_list=um_patches_list,
history=history, retur_history=True)
prit(f'User: {questio}')
prit(f'Assistat: {respose}')
Streamig output
Besides this method, you ca also use the followig code to get streamed output.
from trasformers import TextIteratorStreamer
from threadig import Thread
# Iitialize the streamer
streamer = TextIteratorStreamer(tokeizer, skip_prompt=True, skip_special_tokes=True, timeout=10)
# Defie the geeratio cofiguratio
geeratio_cofig = dict(um_beams=1, max_ew_tokes=1024, do_sample=False, streamer=streamer)
# Start the model chat i a separate thread
thread = Thread(target=model.chat, kwargs=dict(
tokeizer=tokeizer, pixel_values=pixel_values, questio=questio,
history=Noe, retur_history=False, geeratio_cofig=geeratio_cofig,
))
thread.start()
# Iitialize a empty strig to store the geerated text
geerated_text = ''
# Loop through the streamer to get the ew text as it is geerated
for ew_text i streamer:
if ew_text == model.cov_template.sep:
break
geerated_text += ew_text
prit(ew_text, ed='', flush=True) # Prit each ew chuk of geerated text o the same lie
Fietue
SWIFT from ModelScope commuity has supported the fie-tuig (Image/Video) of IterVL, please check this lik for more details.
Deploymet
LMDeploy
Warig: This model is ot yet supported by LMDeploy.
vLLM
TODO
Ollama
TODO
Licese
This project is released uder the MIT licese, while Qwe2 is licesed uder the Togyi Qiawe LICENSE.
Citatio
If you fid this project useful i your research, please cosider citig:
@article{che2023itervl,
title={IterVL: Scalig up Visio Foudatio Models ad Aligig for Geeric Visual-Liguistic Tasks},
author={Che, Zhe ad Wu, Jiaa ad Wag, Wehai ad Su, Weijie ad Che, Guo ad Xig, Se ad Zhog, Muya ad Zhag, Qiglog ad Zhu, Xizhou ad Lu, Lewei ad Li, Bi ad Luo, Pig ad Lu, Tog ad Qiao, Yu ad Dai, Jifeg},
joural={arXiv preprit arXiv:2312.14238},
year={2023}
}
@article{che2024far,
title={How Far Are We to GPT-4V? Closig the Gap to Commercial Multimodal Models with Ope-Source Suites},
author={Che, Zhe ad Wag, Weiyu ad Tia, Hao ad Ye, Sheglog ad Gao, Zhagwei ad Cui, Erfei ad Tog, Wewe ad Hu, Kogzhi ad Luo, Jiapeg ad Ma, Zheg ad others},
joural={arXiv preprit arXiv:2404.16821},
year={2024}
}
简介
我们很高兴宣布 IterVL 2.0 的发布,这是 IterVL 系列多模态大语言模型的最新版本。IterVL 2.0 提供了多种指令微调的模型,参数从 10 亿到 1080 亿不等。此仓库包含经过指令微调的 IterVL2-1B 模型。
与最先进的开源多模态大语言模型相比,IterVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
IterVL 2.0 使用 8k 上下文窗口进行训练,训练数据包含长文本、多图和视频数据,与 IterVL 1.5 相比,其处理这些类型输入的能力显著提高。更多详细信息,请参阅我们的博客和 GitHub。
模型细节
IterVL 2.0 是一个多模态大语言模型系列,包含各种规模的模型。对于每个规模的模型,我们都会发布针对多模态任务优化的指令微调模型。IterVL2-1B 包含 IterViT-300M-448px、一个 MLP 投影器和 Qwe2-0.5B-Istruct。
性能测试
图像相关评测
评测数据集 |
PaliGemma-3B |
Mii-IterVL-2B-1-5 |
IterVL2-2B |
IterVL2-1B |
模型大小 |
2.9B |
2.2B |
2.2B |
0.9B |
|
|
|
|
|
DocVQAtest |
- |
85.0 |
86.9 |
81.7 |
ChartQAtest |
- |
74.8 |
76.2 |
72.9 |
IfoVQAtest |
- |
55.4 |
58.9 |
50.9 |
TextVQAval |
68.1 |
70.5 |
73.4 |
70.5 |
OCRBech |
614 |
654 |
784 |
754 |
MMEsum |
1686.1 |
1901.5 |
1876.8 |
1794.4 |
RealWorldQA |
55.2 |
57.9 |
57.3 |
50.3 |
AI2Dtest |
68.3 |
69.8 |
74.1 |
64.1 |
MMMUval |
34.9 |
34.6 / 37.4 |
34.3 / 36.3 |
35.4 / 36.7 |
MMBech-ENtest |
71.0 |
70.9 |
73.2 |
65.4 |
MMBech-CNtest |
63.6 |
66.2 |
70.9 |
60.7 |
CCBechdev |
29.6 |
63.5 |
74.7 |
75.7 |
MMVetGPT-4-0613 |
- |
39.3 |
44.6 |
37.8 |
MMVetGPT-4-Turbo |
33.1 |
35.5 |
39.5 |
37.3 |
SEED-Image |
69.6 |
69.8 |
71.6 |
65.6 |
HallBechavg |
32.2 |
37.5 |
37.9 |
33.4 |
MathVistatestmii |
28.7 |
41.1 |
46.3 |
37.7 |
OpeCompassavg |
46.6 |
49.8 |
54.0 |
48.3 |
我们同时使用 IterVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、IfoVQA、TextVQA、MME、AI2D、MMBech、CCBech、MMVet 和 SEED-Image 的结果是使用 IterVL 仓库测试的。OCRBech、RealWorldQA、HallBech 和 MathVista 是使用 VLMEvalKit 进行评估的。
对于MMMU,我们报告了原始分数(左侧:IterVL系列模型使用IterVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpeCompass排行榜收集)。
请注意,使用不同的测试工具包(如 IterVL 和 VLMEvalKit)评估同一模型可能会导致细微差异,这是正常的。代码版本的更新、环境和硬件的变化也可能导致结果的微小差异。
视频相关评测
评测数据集 |
VideoChat2-Phi3 |
Mii-IterVL-2B-1-5 |
IterVL2-2B |
IterVL2-1B |
模型大小 |
4B |
2.2B |
2.2B |
0.9B |
|
|
|
|
|
MVBech |
55.1 |
37.0 |
60.2 |
57.9 |
MMBech-Video8f |
- |
0.99 |
0.97 |
0.95 |
MMBech-Video16f |
- |
1.04 |
1.03 |
0.98 |
Video-MME w/o subs |
- |
42.9 |
45.0 |
42.6 |
Video-MME w subs |
- |
44.7 |
47.3 |
44.7 |
- 我们通过从每个视频中提取 16 帧来评估我们的模型在 MVBech 和 Video-MME 上的性能,每个视频帧被调整为 448x448 的图像。
限制:尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
快速启动
我们提供了一个示例代码,用于使用 trasformers
运行 IterVL2-1B。
我们也欢迎你在我们的在线demo中体验IterVL2的系列模型。目前,由于具备公网IP地址的GPU资源有限,我们目前只能部署最大到26B的模型。我们会在不久之后进行扩容,把更大的模型部署到在线demo上,敬请期待。
请使用 trasformers==4.37.2 以确保模型正常运行。
示例代码请点击这里。
微调
来自ModelScope社区的SWIFT已经支持对IterVL进行微调(图像/视频),详情请查看此链接。
部署
LMDeploy
注意:此模型尚未被 LMDeploy 支持。
vLLM
TODO
Ollama
TODO
开源许可证
该项目采用 MIT 许可证发布,而 Qwe2 则采用 通义千问 许可证。
引用
如果您发现此项目对您的研究有用,可以考虑引用我们的论文:
@article{che2023itervl,
title={IterVL: Scalig up Visio Foudatio Models ad Aligig for Geeric Visual-Liguistic Tasks},
author={Che, Zhe ad Wu, Jiaa ad Wag, Wehai ad Su, Weijie ad Che, Guo ad Xig, Se ad Zhog, Muya ad Zhag, Qiglog ad Zhu, Xizhou ad Lu, Lewei ad Li, Bi ad Luo, Pig ad Lu, Tog ad Qiao, Yu ad Dai, Jifeg},
joural={arXiv preprit arXiv:2312.14238},
year={2023}
}
@article{che2024far,
title={How Far Are We to GPT-4V? Closig the Gap to Commercial Multimodal Models with Ope-Source Suites},
author={Che, Zhe ad Wag, Weiyu ad Tia, Hao ad Ye, Sheglog ad Gao, Zhagwei ad Cui, Erfei ad Tog, Wewe ad Hu, Kogzhi ad Luo, Jiapeg ad Ma, Zheg ad others},
joural={arXiv preprit arXiv:2404.16821},
year={2024}
}
评论