开源地址
https://modelscope.cn/models/AI-ModelScope/QA-CLIP-ViT-L-14授权协议
apache-2.0

Itroductio

This project aims to provide a better Chiese CLIP model. The traiig data used i this project cosists of publicly accessible image URLs ad related Chiese text descriptios, totalig 400 millio. After screeig, we ultimately used 100 millio data for traiig. This project is produced by QQ-ARC Joit Lab, Tecet PCG. For more detailed iformatio, please refer to the mai page of the QA-CLIP project. We have also ope-sourced our code o GitHub, QA-CLIP, ad welcome to star!

Results

We coducted zero-shot tests o MUGE Retrieval, Flickr30K-CN, ad COCO-CN datasets for image-text retrieval tasks. For the image zero-shot classificatio task, we tested o the ImageNet dataset. The test results are show i the table below:

Flickr30K-CN Zero-shot Retrieval (Official Test Set):

Task	Text-to-Image	Image-to-Text
Metric	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP_RN50	48.8	76.0	84.6	60.0	85.9	92.0
QA-CLIP_RN50	50.5	77.4	86.1	67.1	87.9	93.2
CN-CLIP_ViT-B/16	62.7	86.9	92.8	74.6	93.5	97.1
QA-CLIP_ViT-B/16	63.8	88.0	93.2	78.4	96.1	98.5
CN-CLIP_ViT-L/14	68.0	89.7	94.4	80.2	96.6	98.2
AltClip_ViT-L/14	69.7	90.1	94.8	84.8	97.7	99.1
QA-CLIP_ViT-L/14	69.3	90.3	94.7	85.3	97.9	99.2

MUGE Zero-shot Retrieval (Official Validatio Set):

Task	Text-to-Image	Image-to-Text
Metric	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP_RN50	42.6	68.5	78.0	30.0	56.2	66.9
QA-CLIP_RN50	44.0	69.9	79.5	32.4	59.5	70.3
CN-CLIP_ViT-B/16	52.1	76.7	84.4	38.7	65.6	75.1
QA-CLIP_ViT-B/16	53.2	77.7	85.1	40.7	68.2	77.2
CN-CLIP_ViT-L/14	56.4	79.8	86.2	42.6	69.8	78.6
AltClip_ViT-L/14	29.6	49.9	58.8	21.4	42.0	51.9
QA-CLIP_ViT-L/14	57.4	81.0	87.7	45.5	73.0	81.4

COCO-CN Zero-shot Retrieval (Official Test Set):

Task	Text-to-Image	Image-to-Text
Metric	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP_RN50	48.1	81.3	90.5	50.9	81.1	90.5
QA-CLIP_RN50	50.1	82.5	91.7	56.7	85.2	92.9
CN-CLIP_ViT-B/16	62.2	87.1	94.9	56.3	84.0	93.3
QA-CLIP_ViT-B/16	62.9	87.7	94.7	61.5	87.6	94.8
CN-CLIP_ViT-L/14	64.9	88.8	94.2	60.6	84.4	93.1
AltClip_ViT-L/14	63.5	87.6	93.5	62.6	88.5	95.9
QA-CLIP_ViT-L/14	65.7	90.2	95.0	64.5	88.3	95.1

Zero-shot Image Classificatio o ImageNet:

Task	ImageNet
CN-CLIP_RN50	33.5
QA-CLIP_RN50	35.5
CN-CLIP_ViT-B/16	48.4
QA-CLIP_ViT-B/16	49.7
CN-CLIP_ViT-L/14	54.7
QA-CLIP_ViT-L/14	55.8

Gettig Started

Iferece Code

Iferece code example：

from PIL import Image
import requests
from trasformers import ChieseCLIPProcessor, ChieseCLIPModel

model = ChieseCLIPModel.from_pretraied("TecetARC/QA-CLIP-ViT-L-14")
processor = ChieseCLIPProcessor.from_pretraied("TecetARC/QA-CLIP-ViT-L-14")

url = "https://clip-c-beijig.oss-c-beijig.aliyucs.com/pokemo.jpeg"
image = Image.ope(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmader, Pikachu i Eglish
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
iputs = processor(images=image, retur_tesors="pt")
image_features = model.get_image_features(**iputs)
image_features = image_features / image_features.orm(p=2, dim=-1, keepdim=True)  # ormalize

# compute text features
iputs = processor(text=texts, paddig=True, retur_tesors="pt")
text_features = model.get_text_features(**iputs)
text_features = text_features / text_features.orm(p=2, dim=-1, keepdim=True)  # ormalize

# compute image-text similarity scores
iputs = processor(text=texts, images=image, retur_tesors="pt", paddig=True)
outputs = model(**iputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)

Ackowledgmets

The project code is based o implemetatio of Chiese-CLIP, ad we are very grateful for their outstadig ope-source cotributios.

中文说明 | English Introduction This project aims to provide a better Chinese CLIP model. The training d

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

QA-CLIP-ViT-L-14

技术信息

作品详情

Itroductio

Results

Gettig Started

Iferece Code

Ackowledgmets

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐