OpenFlamingo 大型 LMM 训练框架_开源项目-程序员客栈

开源地址
https://github.com/mlfoundations/open_flamingo授权协议
未知

OpeFlamigo的核心是一个支持大型多模态模型(LMM)训练和评估的框架，DeepMid的Flamigo模型的开源复制品。

主要包含如下内容：

一个用于训练Flamigo风格LMM的Pytho框架（基于Lucidrais的flamigo实现和DavidHasmair的flamigo-mii存储库）。具有交错图像和文本序列的大规模多模态数据集。视觉语言任务的上下文学习评估基准。OpeFlamigo-9B模型（基于LLaMA）的第一个版本

OpeFlamigo架构如下图，使用交叉注意力层来融合预训练的视觉编码器和语言模型。

安装要在现有环境中安装包，请运行

pipistallope-flamigo或者创建运行OpeFlamigo的coda环境，运行

codaevcreate-feviromet.yml用法我们使用CLIPViT-Large视觉编码器和LLaMA-7B语言模型提供初始OpeFlamigo9B模型。一般来说，我们支持任何CLIP视觉编码器。对于语言模型，我们支持LLaMA、OPT、GPT-Neo、GPT-J和Pythia模型。

注意：要使用LLaMA模型，您需要通过以下方式安装最新版本的变压器pipistallgit+https://github.com/huggigface/trasformers使用此脚本将LLaMA权重转换为HuggigFace格式。

初始化OpeFlamigo模型fromope_flamigoimportcreate_model_ad_trasforms

model,image_processor,tokeizer=create_model_ad_trasforms(clip_visio_ecoder_path="ViT-L-14",clip_visio_ecoder_pretraied="opeai",lag_ecoder_path="",tokeizer_path="",cross_att_every__layers=4)

grabmodelcheckpoitfromhuggigfacehub

fromhuggigface_hubimporthf_hub_dowloadimporttorch

checkpoit_path=hf_hub_dowload("opeflamigo/OpeFlamigo-9B","checkpoit.pt")model.load_state_dict(torch.load(checkpoit_path),strict=False)

生成文本这是一个以交错图像/文本为条件生成文本的示例，在这种情况下将进行少镜头图像字幕。

fromPILimportImageimportrequests

"""Step1:Loadimages"""demo_image_oe=Image.ope(requests.get("https://images.cocodataset.org/val2017/000000039769.jpg",stream=True).raw)

demo_image_two=Image.ope(requests.get("https://images.cocodataset.org/test-stuff2017/000000028137.jpg",stream=True).raw)

query_image=Image.ope(requests.get("https://images.cocodataset.org/test-stuff2017/000000028352.jpg",stream=True).raw)

"""Step2:PreprocessigimagesDetails:ForOpeFlamigo,weexpecttheimagetobeatorchtesorofshapebatch_sizexum_mediaxum_framesxchaelsxheightxwidth.Ithiscasebatch_size=1,um_media=3,um_frames=1(thiswillalwaysbeoeexpectforvideowhichwedo'tsupportyet),chaels=3,height=224,width=224."""visio_x=[image_processor(demo_image_oe).usqueeze(0),image_processor(demo_image_two).usqueeze(0),image_processor(query_image).usqueeze(0)]visio_x=torch.cat(visio_x,dim=0)visio_x=visio_x.usqueeze(1).usqueeze(0)

"""Step3:PreprocessigtextDetails:Ithetextweexpectaspecialtoketoidicatewhereaimageis.Wealsoexpecta<|edofchuk|>specialtoketoidicatetheedofthetextportioassociatedwithaimage."""tokeizer.paddig_side="left"#Forgeeratiopaddigtokesshouldbeotheleftlag_x=tokeizer(["Aimageoftwocats.<|edofchuk|>Aimageofabathroomsik.<|edofchuk|>Aimageof"],retur_tesors="pt",)

"""Step4:Geeratetext"""geerated_text=model.geerate(visio_x=visio_x,lag_x=lag_x["iput_ids"],attetio_mask=lag_x["attetio_mask"],max_ew_tokes=20,um_beams=3,)

prit("Geeratedtext:",tokeizer.decode(geerated_text[0]))

方法OpeFlamigo是一种多模态语言模型，可用于多种任务。它在大型多模态数据集（例如MultimodalC4）上进行训练，可用于生成以交错图像/文本为条件的文本。例如，OpeFlamigo可用于为图像生成标题，或根据图像和文本段落生成问题。这种方法的好处是我们能够使用上下文训练快速适应新任务。

模型架构OpeFlamigo寻求使用交叉注意力层来融合预训练的视觉编码器和语言模型。模型架构如下图所示。

OpenFlamingo 的核心是一个支持大型多模态模型 (LMM) 训练和评估的框架，DeepMind 的 Flamingo 模型的开源复制品。主要包含如下内容：一个用于训练 Flamin...