SigLIP model pre-traied o WebLi at resolutio 384x384. It was itroduced i the paper Sigmoid Loss for Laguage Image Pre-Traiig by Zhai et al. ad first released i this repository. This model has the SoViT-400m architecture, which is the shape-optimized versio as preseted i Gettig ViT i Shape: Scalig Laws for Compute-Optimal Model Desig by Alabdulmohsi et al. Disclaimer: The team releasig SigLIP did ot write a model card for this model so this model card has bee writte by the Huggig Face team. SigLIP is CLIP, a multimodal model, with a better loss fuctio. The sigmoid loss operates solely o image-text pairs ad does ot require a global view of the pairwise similarities for ormalizatio. This allows further scalig up the batch size, while also performig better at smaller batch sizes. A TLDR of SigLIP by oe of the authors ca be foud here. You ca use the raw model for tasks like zero-shot image classificatio ad image-text retrieval. See the model hub to look for
other versios o a task that iterests you. Here is how to use this model to perform zero-shot image classificatio: Alteratively, oe ca leverage the pipelie API which abstracts away the complexity for the user: For more code examples, we refer to the documetatio. SigLIP is pre-traied o the WebLI dataset (Che et al., 2023). Images are resized/rescaled to the same resolutio (384x384) ad ormalized across the RGB chaels with mea (0.5, 0.5, 0.5) ad stadard deviatio (0.5, 0.5, 0.5). Texts are tokeized ad padded to the same legth (64 tokes). The model was traied o 16 TPU-v4 chips for three days. Evaluatio of SigLIP compared to CLIP is show below (take from the paper).SigLIP (shape-optimized model)
Model descriptio
Iteded uses & limitatios
How to use
from PIL import Image
import requests
from trasformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretraied("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretraied("google/siglip-so400m-patch14-384")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.ope(requests.get(url, stream=True).raw)
texts = ["a photo of 2 cats", "a photo of 2 dogs"]
iputs = processor(text=texts, images=image, paddig="max_legth", retur_tesors="pt")
with torch.o_grad():
outputs = model(**iputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
prit(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
from trasformers import pipelie
from PIL import Image
import requests
# load pipe
image_classifier = pipelie(task="zero-shot-image-classificatio", model="google/siglip-so400m-patch14-384")
# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.ope(requests.get(url, stream=True).raw)
# iferece
outputs = image_classifier(image, cadidate_labels=["2 cats", "a plae", "a remote"])
outputs = [{"score": roud(output["score"], 4), "label": output["label"] } for output i outputs]
prit(outputs)
Traiig procedure
Traiig data
Preprocessig
Compute
Evaluatio results
BibTeX etry ad citatio ifo
@misc{zhai2023sigmoid,
title={Sigmoid Loss for Laguage Image Pre-Traiig},
author={Xiaohua Zhai ad Basil Mustafa ad Alexader Kolesikov ad Lucas Beyer},
year={2023},
eprit={2303.15343},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
点击空白处退出提示
评论