mxbai-embed-large-v1

This is our base setece embeddig model. It was traied usig AglE loss o our high-quality large scale data. It achieves SOTA performace o BERT-large scale. Fid out more i our blog post.

Quickstart

Here, we provide several ways to produce setece embeddigs. Please ote that you have to provide the prompt Represet this setece for searchig relevat passages: for query if you wat to use it for retrieval. Besides that you do't eed ay prompt.

setece-trasformers

pytho -m pip istall -U setece-trasformers

from setece_trasformers import SeteceTrasformer
from setece_trasformers.util import cos_sim

# 1. load model
model = SeteceTrasformer("mixedbread-ai/mxbai-embed-large-v1")

# For retrieval you eed to pass this prompt.
query = 'Represet this setece for searchig relevat passages: A ma is eatig a piece of bread'

docs = [
    query,
    "A ma is eatig food.",
    "A ma is eatig pasta.",
    "The girl is carryig a baby.",
    "A ma is ridig a horse.",
]

# 2. Ecode
embeddigs = model.ecode(docs)

similarities = cos_sim(embeddigs[0], embeddigs[1:])
prit('similarities:', similarities)

Trasformers

from typig import Dict

import torch
import umpy as p
from trasformers import AutoModel, AutoTokeizer
from setece_trasformers.util import cos_sim

# For retrieval you eed to pass this prompt. Please fid our more i our blog post.
def trasform_query(query: str) -> str:
    """ For retrieval, add the prompt for query (ot for documets).
    """
    retur f'Represet this setece for searchig relevat passages: {query}'

# The model works really well with cls poolig (default) but also with mea pooli.
def poolig(outputs: torch.Tesor, iputs: Dict,  strategy: str = 'cls') -> p.darray:
    if strategy == 'cls':
        outputs = outputs[:, 0]
    elif strategy == 'mea':
        outputs = torch.sum(
            outputs * iputs["attetio_mask"][:, :, Noe], dim=1) / torch.sum(iputs["attetio_mask"])
    else:
        raise NotImplemetedError
    retur outputs.detach().cpu().umpy()

# 1. load model
model_id = 'mixedbread-ai/mxbai-embed-large-v1'
tokeizer = AutoTokeizer.from_pretraied(model_id)
model = AutoModel.from_pretraied(model_id).cuda()


docs = [
    trasform_query('A ma is eatig a piece of bread'),
    "A ma is eatig food.",
    "A ma is eatig pasta.",
    "The girl is carryig a baby.",
    "A ma is ridig a horse.",
]

# 2. ecode
iputs = tokeizer(docs, paddig=True, retur_tesors='pt')
for k, v i iputs.items():
    iputs[k] = v.cuda()
outputs = model(**iputs).last_hidde_state
embeddigs = poolig(outputs, iputs, 'cls')

similarities = cos_sim(embeddigs[0], embeddigs[1:])
prit('similarities:', similarities)

Trasformers.js

If you have't already, you ca istall the Trasformers.js JavaScript library from NPM usig:

pm i @xeova/trasformers

You ca the use the model to compute embeddigs like this:

import { pipelie, cos_sim } from '@xeova/trasformers';

// Create a feature extractio pipelie
cost extractor = await pipelie('feature-extractio', 'mixedbread-ai/mxbai-embed-large-v1', {
    quatized: false, // Commet out this lie to use the quatized versio
});

// Geerate setece embeddigs
cost docs = [
    'Represet this setece for searchig relevat passages: A ma is eatig a piece of bread',
    'A ma is eatig food.',
    'A ma is eatig pasta.',
    'The girl is carryig a baby.',
    'A ma is ridig a horse.',
]
cost output = await extractor(docs, { poolig: 'cls' });

// Compute similarity scores
cost [source_embeddigs, ...documet_embeddigs ] = output.tolist();
cost similarities = documet_embeddigs.map(x => cos_sim(source_embeddigs, x));
cosole.log(similarities); // [0.7919578577247139, 0.6369278664248345, 0.16512018371357193, 0.3620778366720027]

Usig API

You’ll be able to use the models through our API as well. The API is comig soo ad will have some excitig features. Stay tued!

Evaluatio

As of March 2024, our model archives SOTA performace for Bert-large sized models o the MTEB. It ourperforms commercial models like OpeAIs text-embeddig-3-large ad matches the performace of model 20x it's size like the echo-mistral-7b. Our model was traied with o overlap of the MTEB data, which idicates that our model geeralizes well across several domais, tasks ad text legth. We kow there are some limitatios with this model, which will be fixed i v2.

Model	Avg (56 datasets)	Classificatio (12 datasets)	Clusterig (11 datasets)	PairClassificatio (3 datasets)	Rerakig (4 datasets)	Retrieval (15 datasets)	STS (10 datasets)	Summarizatio (1 dataset)
mxbai-embed-large-v1	64.68	75.64	46.71	87.2	60.11	54.39	85.00	32.71
bge-large-e-v1.5	64.23	75.97	46.08	87.12	60.03	54.29	83.11	31.61
mxbai-embed-2d-large-v1	63.25	74.14	46.07	85.89	58.94	51.42	84.9	31.55
omic-embed-text-v1	62.39	74.12	43.91	85.15	55.69	52.81	82.06	30.08
jia-embeddigs-v2-base-e	60.38	73.45	41.73	85.38	56.98	47.87	80.7	31.6
Proprietary Models
OpeAI text-embeddig-3-large	64.58	75.45	49.01	85.72	59.16	55.44	81.73	29.92
Cohere embed-eglish-v3.0	64.47	76.49	47.43	85.84	58.01	55.00	82.62	30.18
OpeAI text-embeddig-ada-002	60.99	70.93	45.90	84.89	56.32	49.25	80.97	30.80

Please fid more iformatio i our blog post.

Commuity

Please joi our Discord Commuity ad share your feedback ad thoughts! We are here to help ad also always happy to chat.

Licese

Apache 2.0

Citatio

@olie{emb2024mxbai,
  title={Ope Source Strikes Bread - New Fluffy Embeddigs Model},
  author={Sea Lee, Aamir Shakir, Darius Koeig, Julius Lipp},
  year={2024},
  url={https://www.mixedbread.ai/blog/mxbai-embed-large-v1},
}

@article{li2023agle,
  title={AglE-optimized Text Embeddigs},
  author={Li, Xiamig ad Li, Jig},
  joural={arXiv preprit arXiv:2309.12871},
  year={2023}
}

The crispy sentence embedding family from mixedbread ai. mxbai-embed-large-v1 This is our base sent

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

mxbai-embed-large-v1

技术信息

作品详情

mxbai-embed-large-v1

Quickstart

setece-trasformers

Trasformers

Trasformers.js

Usig API

Evaluatio

Commuity

Licese

Citatio

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐