Jina AI文本向量模型v2-base-英文_开源AI项目-程序员客栈

Fietuer logo: Fietuer helps you to create experimets i order to improve embeddigs o search tasks. It accompaies you to deliver the last mile of performace-tuig for eural search applicatios.

The text embeddig set traied by Jia AI.

Iteded Usage & Model Ifo

jia-embeddigs-v2-base-e is a Eglish, mooligual embeddig model supportig 8192 sequece legth. It is based o a BERT architecture (JiaBERT) that supports the symmetric bidirectioal variat of ALiBi to allow loger sequece legth. The backboe jia-bert-v2-base-e is pretraied o the C4 dataset. The model is further traied o Jia AI's collectio of more tha 400 millios of setece pairs ad hard egatives. These pairs were obtaied from various domais ad were carefully selected through a thorough cleaig process.

The embeddig model was traied usig 512 sequece legth, but extrapolates to 8k sequece legth (or eve loger) thaks to ALiBi. This makes our model useful for a rage of use cases, especially whe processig log documets is eeded, icludig log documet retrieval, sematic textual similarity, text rerakig, recommedatio, RAG ad LLM-based geerative search, etc.

With a stadard size of 137 millio parameters, the model eables fast iferece while deliverig better performace tha our small model. It is recommeded to use a sigle GPU for iferece. Additioally, we provide the followig embeddig models:

jia-embeddigs-v2-small-e: 33 millio parameters.
jia-embeddigs-v2-base-e: 137 millio parameters (you are here).
[jia-embeddigs-v2-base-zh](): Chiese-Eglish Biligual embeddigs (soo).
[jia-embeddigs-v2-base-de](): Germa-Eglish Biligual embeddigs (soo).
[jia-embeddigs-v2-base-es](): Spaish-Eglish Biligual embeddigs (soo).

Data & Parameters

Jia Embeddigs V2 techical report

Usage

Please apply mea poolig whe itegratig the model.

### Why mea poolig? `mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level. It has bee proved to be the most effective way to produce high-quality setece embeddigs. We offer a `ecode` fuctio to deal with this. However, if you would like to do it without usig the default `ecode` fuctio:

import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel

def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0]
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)

seteces = ['How is the weather today?', 'What is the curret weather like today?']

tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-small-e')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-small-e', trust_remote_code=True)

ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

with torch.o_grad():
    model_output = model(**ecoded_iput)

embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)

You ca use Jia Embeddig models directly from trasformers package:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
from modelscope.hub.api import HubApi
api = HubApi()
api.logi('3a1d14c4-ebc9-4e13-996d-5b5bc152f287')

cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-e', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', 'What is the curret weather like today?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

If you oly wat to hadle shorter sequece, such as 2k, pass the max_legth parameter to the ecode fuctio:

embeddigs = model.ecode(
    ['Very log ... documet'],
    max_legth=2048
)

Fully-maaged Embeddigs Service

Alteratively, you ca use Jia AI's Embeddig platform for fully-maaged access to Jia Embeddigs models.

Use Jia Embeddigs for RAG

Accordig to the latest blog post from LLamaIdex,

I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out.

Plas

Biligual embeddig models supportig more Europea & Asia laguages, icludig Spaish, Frech, Italia ad Japaese.
Multimodal embeddig models eable Multimodal RAG applicatios.
High-performt rerakers.

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid Jia Embeddigs useful i your research, please cite the followig paper:

@misc{güther2023jia,
      title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets}, 
      author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
      year={2023},
      eprit={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Jina AI文本向量模型v2-base-英文

技术信息

作品详情