Model card for BLIP traied o image-text matchig - base architecture (with ViT base backboe) traied o COCO dataset. Authors from the paper write i the abstract: Visio-Laguage Pre-traiig (VLP) has advaced the performace for may visio-laguage tasks. However, most existig pre-traied models oly excel i either uderstadig-based tasks or geeratio-based tasks. Furthermore, performace improvemet has bee largely achieved by scalig up the dataset with oisy image-text pairs collected from the web, which is a suboptimal source of supervisio. I this paper, we propose BLIP, a ew VLP framework which trasfers flexibly to both visio-laguage uderstadig ad geeratio tasks. BLIP effectively utilizes the oisy web data by bootstrappig the captios, where a captioer geerates sythetic captios ad a filter removes the oisy oes. We achieve state-of-the-art results o a wide rage of visio-laguage tasks, such as image-text retrieval (+2.7% i average recall@1), image captioig (+2.8% i CIDEr), ad VQA (+1.6% i VQA score). BLIP also demostrates strog geeralizatio ability whe directly trasferred to videolaguage tasks i a zero-shot maer. Code, models, ad datasets are released. You ca use this model for coditioal ad u-coditioal image captioigBLIP: Bootstrappig Laguage-Image Pre-traiig for Uified Visio-Laguage Uderstadig ad Geeratio
Pull figure from BLIP official repo
TL;DR
Usage
Usig the Pytorch model
Ruig the model o CPU
Click to expad
import requests
from PIL import Image
from trasformers import BlipProcessor, BlipForImageTextRetrieval
processor = BlipProcessor.from_pretraied("Salesforce/blip-itm-base-coco")
model = BlipForImageTextRetrieval.from_pretraied("Salesforce/blip-itm-base-coco")
img_url = 'https://storage.googleapis.com/sfr-visio-laguage-research/BLIP/demo.jpg'
raw_image = Image.ope(requests.get(img_url, stream=True).raw).covert('RGB')
questio = "A woma ad a dog sittig together i a beach."
iputs = processor(raw_image, questio, retur_tesors="pt")
itm_scores = model(**iputs)[0]
cosie_score = model(**iputs, use_itm_head=False)[0]
Ruig the model o GPU
I full precisio
Click to expad
import requests
from PIL import Image
from trasformers import BlipProcessor, BlipForImageTextRetrieval
processor = BlipProcessor.from_pretraied("Salesforce/blip-itm-base-coco")
model = BlipForImageTextRetrieval.from_pretraied("Salesforce/blip-itm-base-coco").to("cuda")
img_url = 'https://storage.googleapis.com/sfr-visio-laguage-research/BLIP/demo.jpg'
raw_image = Image.ope(requests.get(img_url, stream=True).raw).covert('RGB')
questio = "A woma ad a dog sittig together i a beach."
iputs = processor(raw_image, questio, retur_tesors="pt").to("cuda")
itm_scores = model(**iputs)[0]
cosie_score = model(**iputs, use_itm_head=False)[0]
I half precisio (
float16
) Click to expad
import torch
import requests
from PIL import Image
from trasformers import BlipProcessor, BlipForImageTextRetrieval
processor = BlipProcessor.from_pretraied("Salesforce/blip-itm-base-coco")
model = BlipForImageTextRetrieval.from_pretraied("Salesforce/blip-itm-base-coco", torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-visio-laguage-research/BLIP/demo.jpg'
raw_image = Image.ope(requests.get(img_url, stream=True).raw).covert('RGB')
questio = "A woma ad a dog sittig together i a beach."
iputs = processor(raw_image, questio, retur_tesors="pt").to("cuda", torch.float16)
itm_scores = model(**iputs)[0]
cosie_score = model(**iputs, use_itm_head=False)[0]
BibTex ad citatio ifo
@misc{https://doi.org/10.48550/arxiv.2201.12086,
doi = {10.48550/ARXIV.2201.12086},
url = {https://arxiv.org/abs/2201.12086},
author = {Li, Jua ad Li, Dogxu ad Xiog, Caimig ad Hoi, Steve},
keywords = {Computer Visio ad Patter Recogitio (cs.CV), FOS: Computer ad iformatio scieces, FOS: Computer ad iformatio scieces},
title = {BLIP: Bootstrappig Laguage-Image Pre-traiig for Uified Visio-Laguage Uderstadig ad Geeratio},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commos Attributio 4.0 Iteratioal}
}
点击空白处退出提示
评论