blip-itm-base-coco

我要开发同款
匿名用户2024年07月31日
50阅读

技术信息

开源地址
https://modelscope.cn/models/thomas/blip-itm-base-coco
授权协议
bsd-3-clause

作品详情

BLIP: Bootstrappig Laguage-Image Pre-traiig for Uified Visio-Laguage Uderstadig ad Geeratio

Model card for BLIP traied o image-text matchig - base architecture (with ViT base backboe) traied o COCO dataset.

BLIP.gif
Pull figure from BLIP official repo

TL;DR

Authors from the paper write i the abstract:

Visio-Laguage Pre-traiig (VLP) has advaced the performace for may visio-laguage tasks. However, most existig pre-traied models oly excel i either uderstadig-based tasks or geeratio-based tasks. Furthermore, performace improvemet has bee largely achieved by scalig up the dataset with oisy image-text pairs collected from the web, which is a suboptimal source of supervisio. I this paper, we propose BLIP, a ew VLP framework which trasfers flexibly to both visio-laguage uderstadig ad geeratio tasks. BLIP effectively utilizes the oisy web data by bootstrappig the captios, where a captioer geerates sythetic captios ad a filter removes the oisy oes. We achieve state-of-the-art results o a wide rage of visio-laguage tasks, such as image-text retrieval (+2.7% i average recall@1), image captioig (+2.8% i CIDEr), ad VQA (+1.6% i VQA score). BLIP also demostrates strog geeralizatio ability whe directly trasferred to videolaguage tasks i a zero-shot maer. Code, models, ad datasets are released.

Usage

You ca use this model for coditioal ad u-coditioal image captioig

Usig the Pytorch model

Ruig the model o CPU

Click to expad

import requests
from PIL import Image
from trasformers import BlipProcessor, BlipForImageTextRetrieval

processor = BlipProcessor.from_pretraied("Salesforce/blip-itm-base-coco")
model = BlipForImageTextRetrieval.from_pretraied("Salesforce/blip-itm-base-coco")

img_url = 'https://storage.googleapis.com/sfr-visio-laguage-research/BLIP/demo.jpg' 
raw_image = Image.ope(requests.get(img_url, stream=True).raw).covert('RGB')

questio = "A woma ad a dog sittig together i a beach."
iputs = processor(raw_image, questio, retur_tesors="pt")

itm_scores = model(**iputs)[0]
cosie_score = model(**iputs, use_itm_head=False)[0]

Ruig the model o GPU

I full precisio

Click to expad

import requests
from PIL import Image
from trasformers import BlipProcessor, BlipForImageTextRetrieval

processor = BlipProcessor.from_pretraied("Salesforce/blip-itm-base-coco")
model = BlipForImageTextRetrieval.from_pretraied("Salesforce/blip-itm-base-coco").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-visio-laguage-research/BLIP/demo.jpg' 
raw_image = Image.ope(requests.get(img_url, stream=True).raw).covert('RGB')

questio = "A woma ad a dog sittig together i a beach."
iputs = processor(raw_image, questio, retur_tesors="pt").to("cuda")

itm_scores = model(**iputs)[0]
cosie_score = model(**iputs, use_itm_head=False)[0]

I half precisio (float16)

Click to expad

import torch
import requests
from PIL import Image
from trasformers import BlipProcessor, BlipForImageTextRetrieval

processor = BlipProcessor.from_pretraied("Salesforce/blip-itm-base-coco")
model = BlipForImageTextRetrieval.from_pretraied("Salesforce/blip-itm-base-coco", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-visio-laguage-research/BLIP/demo.jpg' 
raw_image = Image.ope(requests.get(img_url, stream=True).raw).covert('RGB')

questio = "A woma ad a dog sittig together i a beach."
iputs = processor(raw_image, questio, retur_tesors="pt").to("cuda", torch.float16)

itm_scores = model(**iputs)[0]
cosie_score = model(**iputs, use_itm_head=False)[0]

BibTex ad citatio ifo

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},

  url = {https://arxiv.org/abs/2201.12086},

  author = {Li, Jua ad Li, Dogxu ad Xiog, Caimig ad Hoi, Steve},

  keywords = {Computer Visio ad Patter Recogitio (cs.CV), FOS: Computer ad iformatio scieces, FOS: Computer ad iformatio scieces},

  title = {BLIP: Bootstrappig Laguage-Image Pre-traiig for Uified Visio-Laguage Uderstadig ad Geeratio},

  publisher = {arXiv},

  year = {2022},

  copyright = {Creative Commos Attributio 4.0 Iteratioal}
}

功能介绍

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Genera

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论