japanese-stable-clip-vit-l-16

我要开发同款
匿名用户2024年07月31日
30阅读
所属分类ai、japanese_stable_clip、pytorch、japanese-stable-clip、clip
开源地址https://modelscope.cn/models/AI-ModelScope/japanese-stable-clip-vit-l-16
授权协议other

作品详情

Japanese Stable CLIP ViT-L/16

Model Details

Japanese Stable CLIP is a Japanese CLIP (Contrastive Language-Image Pre-Training) model that enables to map both Japanese texts and images to the same embedding space. This model alone is capable of tasks such as zero-shot image classification and text-to-image retrieval. Furthermore, when combined with other components, it can be used as part of generative models, such as image-to-text and text-to-image generation.

示例代码

from typing import Union, List
import ftfy, html, re, io
import requests
from PIL import Image
import torch
from modelscope import AutoModel, AutoTokenizer, AutoImageProcessor, BatchFeature

def basic_clean(text):
    text = ftfy.fix_text(text)
    text = html.unescape(html.unescape(text))
    return text.strip()

def whitespace_clean(text):
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    return text

def tokenize(
    tokenizer,
    texts: Union[str, List[str]],
    max_seq_len: int = 77,
):
    if isinstance(texts, str):
        texts = [texts]
    texts = [whitespace_clean(basic_clean(text)) for text in texts]

    inputs = tokenizer(
        texts,
        max_length=max_seq_len - 1,
        padding="max_length",
        truncation=True,
        add_special_tokens=False,
    )
    # add bos token at first place
    input_ids = [[tokenizer.bos_token_id] + ids for ids in inputs["input_ids"]]
    attention_mask = [[1] + am for am in inputs["attention_mask"]]
    position_ids = [list(range(0, len(input_ids[0])))] * len(texts)

    return BatchFeature(
        {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "position_ids": torch.tensor(position_ids, dtype=torch.long),
        }
    )

device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "AI-ModelScope/japanese-stable-clip-vit-l-16"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoImageProcessor.from_pretrained(model_path)

# Run!
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(images=image, return_tensors="pt").to(device)
text = tokenize(
    tokenizer=tokenizer,
    texts=["犬", "猫", "象"],
).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs) 

Usage

  1. Install packages
  pip install ftfy pillow requests transformers torch sentencepiece protobuf

Model Details

Model ImageNet top-1 accuracy*
Japanese Stable CLIP ViT-L/16 62.06
rinna/japanese-cloob-vit-b-16 54.64
laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k 53
rinna/japanese-clip-vit-b-16 50.69

* Computed scores based on https://github.com/rinnakk/japanese-clip.

Training

The model uses a ViT-L/16 Transformer architecture as an image encoder and a 12-layer BERT as a text encoder with the Japanese tokenizer from rinna/japanese-roberta-base. During training, the image encoder was initialized from the AugReg [vit-large-patch16-224](https://huggingface.co/timm/vitlargepatch16224.augregin21kftin1k ) model and we applied SigLIP (Sigmoid loss for Language-Image Pre-training).

Training Dataset

The training dataset includes the following public datasets:

Use and Limitations

Intended Use

This model is intended to be used by the open-source community in vision-language applications.

Limitations and bias

The training dataset may have contained offensive or inappropriate content even though we applied data filters. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.

How to cite

@misc{JapaneseStableCLIP, 
    url    = {[https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)}, 
    title  = {Japanese Stable CLIP ViT-L/16}, 
    author = {Shing, Makoto and Akiba, Takuya}
}

Contact

  • For questions and comments about the model, please join Stable Community Japan.
  • For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAI_JP.
  • For business and partnership inquiries, please contact partners-jp@stability.ai. ビジネスや協業に関するお問い合わせはpartners-jp@stability.aiにご連絡ください。
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论