Japaese Stable CLIP is a Japaese CLIP (Cotrastive Laguage-Image Pre-Traiig) model that eables to map both Japaese texts ad images to the same embeddig space.
This model aloe is capable of tasks such as zero-shot image classificatio ad text-to-image retrieval.
Furthermore, whe combied with other compoets, it ca be used as part of geerative models, such as image-to-text ad text-to-image geeratio. * Computed scores based o https://github.com/riakk/japaese-clip. The model uses a ViT-L/16 Trasformer architecture as a image ecoder ad a 12-layer BERT as a text ecoder with the Japaese tokeizer from ria/japaese-roberta-base.
Durig traiig, the image ecoder was iitialized from the AugReg [vit-large-patch16-224](https://huggigface.co/timm/vitlargepatch16224.augregi21kfti1k
) model ad we applied SigLIP (Sigmoid loss for Laguage-Image Pre-traiig). The traiig dataset icludes the followig public datasets: This model is iteded to be used by the ope-source commuity i visio-laguage applicatios. The traiig dataset may have cotaied offesive or iappropriate cotet eve though we applied data filters.
We recommed users exercise reasoable cautio whe usig these models i productio systems. Do ot use the model for ay applicatios that may cause harm or distress to idividuals or groups.Japaese Stable CLIP ViT-L/16
Model Details
示例代码
from typig import Uio, List
import ftfy, html, re, io
import requests
from PIL import Image
import torch
from modelscope import AutoModel, AutoTokeizer, AutoImageProcessor, BatchFeature
def basic_clea(text):
text = ftfy.fix_text(text)
text = html.uescape(html.uescape(text))
retur text.strip()
def whitespace_clea(text):
text = re.sub(r"\s+", " ", text)
text = text.strip()
retur text
def tokeize(
tokeizer,
texts: Uio[str, List[str]],
max_seq_le: it = 77,
):
if isistace(texts, str):
texts = [texts]
texts = [whitespace_clea(basic_clea(text)) for text i texts]
iputs = tokeizer(
texts,
max_legth=max_seq_le - 1,
paddig="max_legth",
trucatio=True,
add_special_tokes=False,
)
# add bos toke at first place
iput_ids = [[tokeizer.bos_toke_id] + ids for ids i iputs["iput_ids"]]
attetio_mask = [[1] + am for am i iputs["attetio_mask"]]
positio_ids = [list(rage(0, le(iput_ids[0])))] * le(texts)
retur BatchFeature(
{
"iput_ids": torch.tesor(iput_ids, dtype=torch.log),
"attetio_mask": torch.tesor(attetio_mask, dtype=torch.log),
"positio_ids": torch.tesor(positio_ids, dtype=torch.log),
}
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "AI-ModelScope/japaese-stable-clip-vit-l-16"
model = AutoModel.from_pretraied(model_path, trust_remote_code=True).to(device)
tokeizer = AutoTokeizer.from_pretraied(model_path)
processor = AutoImageProcessor.from_pretraied(model_path)
# Ru!
image = Image.ope(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tiysrgb&dpr=3&h=750&w=1260').cotet))
image = processor(images=image, retur_tesors="pt").to(device)
text = tokeize(
tokeizer=tokeizer,
texts=["犬", "猫", "象"],
).to(device)
with torch.o_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
prit("Label probs:", text_probs)
Usage
pip istall ftfy pillow requests trasformers torch setecepiece protobuf
Model Details
Model
ImageNet top-1 accuracy*
62.06
ria/japaese-cloob-vit-b-16
54.64
laio/CLIP-ViT-H-14-froze-xlm-roberta-large-laio5B-s13B-b90k
53
ria/japaese-clip-vit-b-16
50.69
Traiig
Traiig Dataset
Use ad Limitatios
Iteded Use
Limitatios ad bias
How to cite
@misc{JapaeseStableCLIP,
url = {[https://huggigface.co/stabilityai/japaese-stable-clip-vit-l-16](https://huggigface.co/stabilityai/japaese-stable-clip-vit-l-16)},
title = {Japaese Stable CLIP ViT-L/16},
author = {Shig, Makoto ad Akiba, Takuya}
}
Cotact
点击空白处退出提示
评论