The text embeddig set traied by Jia AI.
The easiest way to startig usig We will publish a report with techical details about the traiig of the biligual models soo.
The traiig of the Eglish model is described i this techical report.
### Why mea poolig?
`mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level.
It has bee proved to be the most effective way to produce high-quality setece embeddigs.
We offer a `ecode` fuctio to deal with this.
However, if you would like to do it without usig the default `ecode` fuctio:
You ca use Jia Embeddig models directly from modelscope package: If you oly wat to hadle shorter sequece, such as 2k, pass the If you wat to use the model together with the setece-trasformers package, make sure that you have istalled the latest release ad set Accordig to the latest blog post from LLamaIdex, I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out. Joi our Discord commuity ad chat with other commuity members about ideas. If you fid Jia Embeddigs useful i your research, please cite the followig paper:
Quick Start
jia-embeddigs-v2-base-zh
is to use Jia AI's Embeddig API.Iteded Usage & Model Ifo
jia-embeddigs-v2-base-zh
is a Chiese/Eglish biligual text jia-embeddigs-v2-base-zh
是支持中英双语的
jia-embeddigs-v2-small-e
: 33 millio parameters.jia-embeddigs-v2-base-e
: 137 millio parameters.jia-embeddigs-v2-base-zh
: 161 millio parameters Chiese-Eglish Biligual embeddigs jia-embeddigs-v2-base-de
: 161 millio parameters Germa-Eglish Biligual embeddigs.jia-embeddigs-v2-base-es
](): Spaish-Eglish Biligual embeddigs (soo).Data & Parameters
Usage
import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel
def mea_poolig(model_output, attetio_mask):
toke_embeddigs = model_output[0]
iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)
seteces = ['How is the weather today?', '今天天气怎么样?']
tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-base-zh')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True)
ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')
with torch.o_grad():
model_output = model(**ecoded_iput)
embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)
!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', '今天天气怎么样?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))
max_legth
parameter to the ecode
fuctio:embeddigs = model.ecode(
['Very log ... documet'],
max_legth=2048
)
trust_remote_code=True
as well:!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', '今天天气怎么样?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))
Alteratives to Usig Trasformers Package
Use Jia Embeddigs for RAG
Cotact
Citatio
@misc{güther2023jia,
title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets},
author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
year={2023},
eprit={2310.19923},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
点击空白处退出提示
评论