jina-clip-v1

我要开发同款
匿名用户2024年07月31日
62阅读

技术信息

开源地址
https://modelscope.cn/models/jinaai/jina-clip-v1
授权协议
Apache License 2.0

作品详情



Fietuer logo: Fietuer helps you to create experimets i order to improve embeddigs o search tasks. It accompaies you to deliver the last mile of performace-tuig for eural search applicatios.

The embeddig set traied by Jia AI.

Jia CLIP: your CLIP model is also your text retriever!

Iteded Usage & Model Ifo

jia-clip-v1 is a state-of-the-art Eglish multimodal (text-image) embeddig model.

Traditioal text embeddig models, such as jia-embeddigs-v2-base-e, excel i text-to-text retrieval but icapable of cross-modal tasks. Models like opeai/clip-vit-base-patch32 effectively alig image ad text embeddigs but are ot optimized for text-to-text retrieval due to their traiig methodologies ad cotext limitatios.

jia-clip-v1 bridges this gap by offerig robust performace i both domais. Its text compoet matches the retrieval efficiecy of jia-embeddigs-v2-base-e, while its overall architecture sets a ew bechmark for cross-modal retrieval. This dual capability makes it a excellet tool for multimodal retrieval-augmeted geeratio (MuRAG) applicatios, eablig seamless text-to-text ad text-to-image searches withi a sigle model.

Data & Parameters

Check out our paper

Usage

  1. The easiest way to startig usig jia-clip-v1-e is to use Jia AI's Embeddigs API.
  2. Alteratively, you ca use Jia CLIP directly via trasformers package.
!pip istall trasformers eiops timm pillow
from trasformers import AutoModel

# Iitialize the model
model = AutoModel.from_pretraied('jiaai/jia-clip-v1', trust_remote_code=True)

# New meaigful seteces
seteces = ['A blue cat', 'A red cat']

# Public image URLs
image_urls = [
    'https://i.piimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.piimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

# Ecode text ad images
text_embeddigs = model.ecode_text(seteces)
image_embeddigs = model.ecode_image(image_urls)  # also accepts PIL.image, local fileames, dataURI

# Compute similarities
prit(text_embeddigs[0] @ text_embeddigs[1].T) # text embeddig similarity
prit(text_embeddigs[0] @ image_embeddigs[0].T) # text-image cross-modal similarity
prit(text_embeddigs[0] @ image_embeddigs[1].T) # text-image cross-modal similarity
prit(text_embeddigs[1] @ image_embeddigs[0].T) # text-image cross-modal similarity
prit(text_embeddigs[1] @ image_embeddigs[1].T)# text-image cross-modal similarity
  1. JavaScript developers ca use Jia CLIP via the Trasformers.js library. Note that to use this model, you eed to istall Trasformers.js v3 from source usig pm istall xeova/trasformers.js#v3.
import { AutoTokeizer, CLIPTextModelWithProjectio, AutoProcessor, CLIPVisioModelWithProjectio, RawImage, cos_sim } from '@xeova/trasformers';

// Load tokeizer ad text model
cost tokeizer = await AutoTokeizer.from_pretraied('jiaai/jia-clip-v1');
cost text_model = await CLIPTextModelWithProjectio.from_pretraied('jiaai/jia-clip-v1');

// Load processor ad visio model
cost processor = await AutoProcessor.from_pretraied('Xeova/clip-vit-base-patch32');
cost visio_model = await CLIPVisioModelWithProjectio.from_pretraied('jiaai/jia-clip-v1');

// Ru tokeizatio
cost texts = ['A blue cat', 'A red cat'];
cost text_iputs = tokeizer(texts, { paddig: true, trucatio: true });

// Compute text embeddigs
cost { text_embeds } = await text_model(text_iputs);

// Read images ad ru processor
cost urls = [
    'https://i.piimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.piimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
];
cost image = await Promise.all(urls.map(url => RawImage.read(url)));
cost image_iputs = await processor(image);

// Compute visio embeddigs
cost { image_embeds } = await visio_model(image_iputs);

//  Compute similarities
cosole.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embeddig similarity
cosole.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity

Performace

Text-Image Retrieval

Name Flickr Image Retr. R@1 Flickr Image Retr. R@5 Flickr Text Retr. R@1 Flickr Text Retr. R@5
ViT-B-32 0.597 0.8398 0.781 0.938
ViT-B-16 0.6216 0.8572 0.822 0.966
jia-clip 0.6748 0.8902 0.811 0.965
Name MSCOCO Image Retr. R@1 MSCOCO Image Retr. R@5 MSCOCO Text Retr. R@1 MSCOCO Text Retr. R@5
ViT-B-32 0.342 0.6001 0.5234 0.7634
ViT-B-16 0.3309 0.5842 0.5242 0.767
jia-clip 0.4111 0.6644 0.5544 0.7904

Text-Text Retrieval

Name STS12 STS15 STS17 STS13 STS14 STS16 STS22 STSBechmark SummEval
jia-embeddigs-v2 0.7427 0.8755 0.8888 0.833 0.7917 0.836 0.6346 0.8404 0.3056
jia-clip 0.7352 0.8746 0.8976 0.8323 0.7868 0.8377 0.6583 0.8493 0.3048
Name ArguAa FiQA2018 NFCorpus Quora SCIDOCS SciFact TRECCOVID
jia-embeddigs-v2 0.4418 0.4158 0.3245 0.882 0.1986 0.6668 0.6591
jia-clip 0.4933 0.3827 0.3352 0.8789 0.2024 0.6734 0.7161

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid jia-clip-v1 useful i your research, please cite the followig paper:

@misc{2405.20204,
    Author = {Adreas Koukouas ad Georgios Mastrapas ad Michael Güther ad Bo Wag ad Scott Martes ad Isabelle Mohr ad Saba Sturua ad Mohammad Kalim Akram ad Joa Fotaals Martíez ad Saahil Ogawala ad Susaa Guzma ad Maximilia Werk ad Na Wag ad Ha Xiao},
    Title = {Jia CLIP: Your CLIP Model Is Also Your Text Retriever},
    Year = {2024},
    Eprit = {arXiv:2405.20204},
}

FAQ

I ecouter this problem, what should I do?

ValueError: The model class you are passig has a `cofig_class` attribute that is ot cosistet with the cofig class you passed (model has <class 'trasformers_modules.jiaai.jia-clip-implemetatio.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.cofiguratio_clip.JiaCLIPCofig'> ad you passed <class 'trasformers_modules.jiaai.jia-clip-implemetatio.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.cofiguratio_cli.JiaCLIPCofig'>. Fix oe of those so they match!

There was a bug i Trasformers library betwee 4.40.x to 4.41.1. You ca update trasformers to >4.41.2 or <=4.40.0

Give oe query, how ca I merge its text-text ad text-image cosie similarity?

Our emperical study shows that text-text cosie similarity is ormally larger tha text-image cosie similarity! If you wat to merge two scores, we recommeded 2 ways:

  1. weighted average of text-text sim ad text-image sim:
combied_scores = sim(text, text) + lambda * sim(text, image)  # optimal lambda depeds o your dataset, but i geeral lambda=2 ca be a good choice.
  1. apply z-score ormalizatio before mergig scores:
# pseudo code
query_documet_mea = p.mea(cos_sim_text_texts)
query_documet_std = p.std(cos_sim_text_texts)
text_image_mea = p.mea(cos_sim_text_images)
text_image_std = p.std(cos_sim_text_images)

query_documet_sim_ormalized = (cos_sim_query_documets - query_documet_mea) / query_documet_std
text_image_sim_ormalized = (cos_sim_text_images - text_image_mea) / text_image_std

功能介绍

The embedding set trained by Jina AI. Jina CLIP: your CLIP model is also your text retriever! Int

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论