The text embeddig set traied by Jia AI.
The embeddig model was traied usig 512 sequece legth, but extrapolates to 8k sequece legth (or eve loger) thaks to ALiBi.
This makes our model useful for a rage of use cases, especially whe processig log documets is eeded, icludig log documet retrieval, sematic textual similarity, text rerakig, recommedatio, RAG ad LLM-based geerative search, etc. With a stadard size of 137 millio parameters, the model eables fast iferece while deliverig better performace tha our small model. It is recommeded to use a sigle GPU for iferece.
Additioally, we provide the followig embeddig models: Jia Embeddigs V2 techical report
### Why mea poolig?
`mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level.
It has bee proved to be the most effective way to produce high-quality setece embeddigs.
We offer a `ecode` fuctio to deal with this.
However, if you would like to do it without usig the default `ecode` fuctio:
You ca use Jia Embeddig models directly from trasformers package: If you oly wat to hadle shorter sequece, such as 2k, pass the Alteratively, you ca use Jia AI's Embeddig platform for fully-maaged access to Jia Embeddigs models. Accordig to the latest blog post from LLamaIdex, I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out. Joi our Discord commuity ad chat with other commuity members about ideas. If you fid Jia Embeddigs useful i your research, please cite the followig paper:
Iteded Usage & Model Ifo
jia-embeddigs-v2-base-e
is a Eglish, mooligual jia-bert-v2-base-e
is pretraied o the C4 dataset.
The model is further traied o Jia AI's collectio of more tha 400 millios of setece pairs ad hard egatives.
These pairs were obtaied from various domais ad were carefully selected through a thorough cleaig process.
jia-embeddigs-v2-small-e
: 33 millio parameters.jia-embeddigs-v2-base-e
: 137 millio parameters jia-embeddigs-v2-base-zh
](): Chiese-Eglish Biligual embeddigs (soo).jia-embeddigs-v2-base-de
](): Germa-Eglish Biligual embeddigs (soo).jia-embeddigs-v2-base-es
](): Spaish-Eglish Biligual embeddigs (soo).Data & Parameters
Usage
import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel
def mea_poolig(model_output, attetio_mask):
toke_embeddigs = model_output[0]
iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)
seteces = ['How is the weather today?', 'What is the curret weather like today?']
tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-small-e')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-small-e', trust_remote_code=True)
ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')
with torch.o_grad():
model_output = model(**ecoded_iput)
embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)
!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
from modelscope.hub.api import HubApi
api = HubApi()
api.logi('3a1d14c4-ebc9-4e13-996d-5b5bc152f287')
cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-e', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', 'What is the curret weather like today?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))
max_legth
parameter to the ecode
fuctio:embeddigs = model.ecode(
['Very log ... documet'],
max_legth=2048
)
Fully-maaged Embeddigs Service
Use Jia Embeddigs for RAG
Plas
Cotact
Citatio
@misc{güther2023jia,
title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets},
author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
year={2023},
eprit={2310.19923},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
点击空白处退出提示
评论