Jina AI文本向量模型v2-base-英文

我要开发同款
匿名用户2024年07月31日
84阅读

技术信息

开源地址
https://modelscope.cn/models/jinaai/jina-embeddings-v2-base-en
授权协议
Apache License 2.0

作品详情



Fietuer logo: Fietuer helps you to create experimets i order to improve embeddigs o search tasks. It accompaies you to deliver the last mile of performace-tuig for eural search applicatios.

The text embeddig set traied by Jia AI.

Iteded Usage & Model Ifo

jia-embeddigs-v2-base-e is a Eglish, mooligual embeddig model supportig 8192 sequece legth. It is based o a BERT architecture (JiaBERT) that supports the symmetric bidirectioal variat of ALiBi to allow loger sequece legth. The backboe jia-bert-v2-base-e is pretraied o the C4 dataset. The model is further traied o Jia AI's collectio of more tha 400 millios of setece pairs ad hard egatives. These pairs were obtaied from various domais ad were carefully selected through a thorough cleaig process.

The embeddig model was traied usig 512 sequece legth, but extrapolates to 8k sequece legth (or eve loger) thaks to ALiBi. This makes our model useful for a rage of use cases, especially whe processig log documets is eeded, icludig log documet retrieval, sematic textual similarity, text rerakig, recommedatio, RAG ad LLM-based geerative search, etc.

With a stadard size of 137 millio parameters, the model eables fast iferece while deliverig better performace tha our small model. It is recommeded to use a sigle GPU for iferece. Additioally, we provide the followig embeddig models:

  • jia-embeddigs-v2-small-e: 33 millio parameters.
  • jia-embeddigs-v2-base-e: 137 millio parameters (you are here).
  • [jia-embeddigs-v2-base-zh](): Chiese-Eglish Biligual embeddigs (soo).
  • [jia-embeddigs-v2-base-de](): Germa-Eglish Biligual embeddigs (soo).
  • [jia-embeddigs-v2-base-es](): Spaish-Eglish Biligual embeddigs (soo).

Data & Parameters

Jia Embeddigs V2 techical report

Usage

Please apply mea poolig whe itegratig the model.

### Why mea poolig? `mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level. It has bee proved to be the most effective way to produce high-quality setece embeddigs. We offer a `ecode` fuctio to deal with this. However, if you would like to do it without usig the default `ecode` fuctio:

import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel

def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0]
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)

seteces = ['How is the weather today?', 'What is the curret weather like today?']

tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-small-e')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-small-e', trust_remote_code=True)

ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

with torch.o_grad():
    model_output = model(**ecoded_iput)

embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)

You ca use Jia Embeddig models directly from trasformers package:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
from modelscope.hub.api import HubApi
api = HubApi()
api.logi('3a1d14c4-ebc9-4e13-996d-5b5bc152f287')

cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-e', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', 'What is the curret weather like today?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

If you oly wat to hadle shorter sequece, such as 2k, pass the max_legth parameter to the ecode fuctio:

embeddigs = model.ecode(
    ['Very log ... documet'],
    max_legth=2048
)

Fully-maaged Embeddigs Service

Alteratively, you ca use Jia AI's Embeddig platform for fully-maaged access to Jia Embeddigs models.

Use Jia Embeddigs for RAG

Accordig to the latest blog post from LLamaIdex,

I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out.

Plas

  1. Biligual embeddig models supportig more Europea & Asia laguages, icludig Spaish, Frech, Italia ad Japaese.
  2. Multimodal embeddig models eable Multimodal RAG applicatios.
  3. High-performt rerakers.

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid Jia Embeddigs useful i your research, please cite the followig paper:

@misc{güther2023jia,
      title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets}, 
      author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
      year={2023},
      eprit={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

功能介绍

The text embedding set trained by Jina AI. Intended Usage & Model Info jina-embeddings-v2-base-en i

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论