The embeddig set traied by Jia AI.
Jia CLIP: your CLIP model is also your text retriever!
Iteded Usage & Model Ifo
jia-clip-v1
is a state-of-the-art Eglish multimodal (text-image) embeddig model.
Traditioal text embeddig models, such as jia-embeddigs-v2-base-e, excel i text-to-text retrieval but icapable of cross-modal tasks. Models like opeai/clip-vit-base-patch32 effectively alig image ad text embeddigs but are ot optimized for text-to-text retrieval due to their traiig methodologies ad cotext limitatios.
jia-clip-v1
bridges this gap by offerig robust performace i both domais.
Its text compoet matches the retrieval efficiecy of jia-embeddigs-v2-base-e
, while its overall architecture sets a ew bechmark for cross-modal retrieval.
This dual capability makes it a excellet tool for multimodal retrieval-augmeted geeratio (MuRAG) applicatios, eablig seamless text-to-text ad text-to-image searches withi a sigle model.
Data & Parameters
Check out our paper
Usage
- The easiest way to startig usig jia-clip-v1-e is to use Jia AI's Embeddigs API.
- Alteratively, you ca use Jia CLIP directly via trasformers package.
!pip istall trasformers eiops timm pillow
from trasformers import AutoModel
# Iitialize the model
model = AutoModel.from_pretraied('jiaai/jia-clip-v1', trust_remote_code=True)
# New meaigful seteces
seteces = ['A blue cat', 'A red cat']
# Public image URLs
image_urls = [
'https://i.piimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.piimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]
# Ecode text ad images
text_embeddigs = model.ecode_text(seteces)
image_embeddigs = model.ecode_image(image_urls) # also accepts PIL.image, local fileames, dataURI
# Compute similarities
prit(text_embeddigs[0] @ text_embeddigs[1].T) # text embeddig similarity
prit(text_embeddigs[0] @ image_embeddigs[0].T) # text-image cross-modal similarity
prit(text_embeddigs[0] @ image_embeddigs[1].T) # text-image cross-modal similarity
prit(text_embeddigs[1] @ image_embeddigs[0].T) # text-image cross-modal similarity
prit(text_embeddigs[1] @ image_embeddigs[1].T)# text-image cross-modal similarity
- JavaScript developers ca use Jia CLIP via the Trasformers.js library. Note that to use this model, you eed to istall Trasformers.js v3 from source usig
pm istall xeova/trasformers.js#v3
.
import { AutoTokeizer, CLIPTextModelWithProjectio, AutoProcessor, CLIPVisioModelWithProjectio, RawImage, cos_sim } from '@xeova/trasformers';
// Load tokeizer ad text model
cost tokeizer = await AutoTokeizer.from_pretraied('jiaai/jia-clip-v1');
cost text_model = await CLIPTextModelWithProjectio.from_pretraied('jiaai/jia-clip-v1');
// Load processor ad visio model
cost processor = await AutoProcessor.from_pretraied('Xeova/clip-vit-base-patch32');
cost visio_model = await CLIPVisioModelWithProjectio.from_pretraied('jiaai/jia-clip-v1');
// Ru tokeizatio
cost texts = ['A blue cat', 'A red cat'];
cost text_iputs = tokeizer(texts, { paddig: true, trucatio: true });
// Compute text embeddigs
cost { text_embeds } = await text_model(text_iputs);
// Read images ad ru processor
cost urls = [
'https://i.piimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.piimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
];
cost image = await Promise.all(urls.map(url => RawImage.read(url)));
cost image_iputs = await processor(image);
// Compute visio embeddigs
cost { image_embeds } = await visio_model(image_iputs);
// Compute similarities
cosole.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embeddig similarity
cosole.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
Performace
Text-Image Retrieval
Name |
Flickr Image Retr. R@1 |
Flickr Image Retr. R@5 |
Flickr Text Retr. R@1 |
Flickr Text Retr. R@5 |
ViT-B-32 |
0.597 |
0.8398 |
0.781 |
0.938 |
ViT-B-16 |
0.6216 |
0.8572 |
0.822 |
0.966 |
jia-clip |
0.6748 |
0.8902 |
0.811 |
0.965 |
Name |
MSCOCO Image Retr. R@1 |
MSCOCO Image Retr. R@5 |
MSCOCO Text Retr. R@1 |
MSCOCO Text Retr. R@5 |
ViT-B-32 |
0.342 |
0.6001 |
0.5234 |
0.7634 |
ViT-B-16 |
0.3309 |
0.5842 |
0.5242 |
0.767 |
jia-clip |
0.4111 |
0.6644 |
0.5544 |
0.7904 |
Text-Text Retrieval
Name |
STS12 |
STS15 |
STS17 |
STS13 |
STS14 |
STS16 |
STS22 |
STSBechmark |
SummEval |
jia-embeddigs-v2 |
0.7427 |
0.8755 |
0.8888 |
0.833 |
0.7917 |
0.836 |
0.6346 |
0.8404 |
0.3056 |
jia-clip |
0.7352 |
0.8746 |
0.8976 |
0.8323 |
0.7868 |
0.8377 |
0.6583 |
0.8493 |
0.3048 |
Name |
ArguAa |
FiQA2018 |
NFCorpus |
Quora |
SCIDOCS |
SciFact |
TRECCOVID |
jia-embeddigs-v2 |
0.4418 |
0.4158 |
0.3245 |
0.882 |
0.1986 |
0.6668 |
0.6591 |
jia-clip |
0.4933 |
0.3827 |
0.3352 |
0.8789 |
0.2024 |
0.6734 |
0.7161 |
Cotact
Joi our Discord commuity ad chat with other commuity members about ideas.
Citatio
If you fid jia-clip-v1
useful i your research, please cite the followig paper:
@misc{2405.20204,
Author = {Adreas Koukouas ad Georgios Mastrapas ad Michael Güther ad Bo Wag ad Scott Martes ad Isabelle Mohr ad Saba Sturua ad Mohammad Kalim Akram ad Joa Fotaals Martíez ad Saahil Ogawala ad Susaa Guzma ad Maximilia Werk ad Na Wag ad Ha Xiao},
Title = {Jia CLIP: Your CLIP Model Is Also Your Text Retriever},
Year = {2024},
Eprit = {arXiv:2405.20204},
}
FAQ
I ecouter this problem, what should I do?
ValueError: The model class you are passig has a `cofig_class` attribute that is ot cosistet with the cofig class you passed (model has <class 'trasformers_modules.jiaai.jia-clip-implemetatio.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.cofiguratio_clip.JiaCLIPCofig'> ad you passed <class 'trasformers_modules.jiaai.jia-clip-implemetatio.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.cofiguratio_cli.JiaCLIPCofig'>. Fix oe of those so they match!
There was a bug i Trasformers library betwee 4.40.x to 4.41.1. You ca update trasformers to >4.41.2 or <=4.40.0
Give oe query, how ca I merge its text-text ad text-image cosie similarity?
Our emperical study shows that text-text cosie similarity is ormally larger tha text-image cosie similarity!
If you wat to merge two scores, we recommeded 2 ways:
- weighted average of text-text sim ad text-image sim:
combied_scores = sim(text, text) + lambda * sim(text, image) # optimal lambda depeds o your dataset, but i geeral lambda=2 ca be a good choice.
- apply z-score ormalizatio before mergig scores:
# pseudo code
query_documet_mea = p.mea(cos_sim_text_texts)
query_documet_std = p.std(cos_sim_text_texts)
text_image_mea = p.mea(cos_sim_text_images)
text_image_std = p.std(cos_sim_text_images)
query_documet_sim_ormalized = (cos_sim_query_documets - query_documet_mea) / query_documet_std
text_image_sim_ormalized = (cos_sim_text_images - text_image_mea) / text_image_std
评论