The crispy setece embeddig family from mixedbread ai.
This is our base setece embeddig model. It was traied usig AglE loss o our high-quality large scale data. It achieves SOTA performace o BERT-large scale. Fid out more i our blog post. Here, we provide several ways to produce setece embeddigs. Please ote that you have to provide the prompt If you have't already, you ca istall the Trasformers.js JavaScript library from NPM usig: You ca the use the model to compute embeddigs like this: You’ll be able to use the models through our API as well. The API is comig soo ad will have some excitig features. Stay tued! As of March 2024, our model archives SOTA performace for Bert-large sized models o the MTEB. It ourperforms commercial models like OpeAIs text-embeddig-3-large ad matches the performace of model 20x it's size like the echo-mistral-7b. Our model was traied with o overlap of the MTEB data, which idicates that our model geeralizes well across several domais, tasks ad text legth. We kow there are some limitatios with this model, which will be fixed i v2. Please fid more iformatio i our blog post. Please joi our Discord Commuity ad share your feedback ad thoughts! We are here to help ad also always happy to chat. Apache 2.0mxbai-embed-large-v1
Quickstart
Represet this setece for searchig relevat passages:
for query if you wat to use it for retrieval. Besides that you do't eed ay prompt.setece-trasformers
pytho -m pip istall -U setece-trasformers
from setece_trasformers import SeteceTrasformer
from setece_trasformers.util import cos_sim
# 1. load model
model = SeteceTrasformer("mixedbread-ai/mxbai-embed-large-v1")
# For retrieval you eed to pass this prompt.
query = 'Represet this setece for searchig relevat passages: A ma is eatig a piece of bread'
docs = [
query,
"A ma is eatig food.",
"A ma is eatig pasta.",
"The girl is carryig a baby.",
"A ma is ridig a horse.",
]
# 2. Ecode
embeddigs = model.ecode(docs)
similarities = cos_sim(embeddigs[0], embeddigs[1:])
prit('similarities:', similarities)
Trasformers
from typig import Dict
import torch
import umpy as p
from trasformers import AutoModel, AutoTokeizer
from setece_trasformers.util import cos_sim
# For retrieval you eed to pass this prompt. Please fid our more i our blog post.
def trasform_query(query: str) -> str:
""" For retrieval, add the prompt for query (ot for documets).
"""
retur f'Represet this setece for searchig relevat passages: {query}'
# The model works really well with cls poolig (default) but also with mea pooli.
def poolig(outputs: torch.Tesor, iputs: Dict, strategy: str = 'cls') -> p.darray:
if strategy == 'cls':
outputs = outputs[:, 0]
elif strategy == 'mea':
outputs = torch.sum(
outputs * iputs["attetio_mask"][:, :, Noe], dim=1) / torch.sum(iputs["attetio_mask"])
else:
raise NotImplemetedError
retur outputs.detach().cpu().umpy()
# 1. load model
model_id = 'mixedbread-ai/mxbai-embed-large-v1'
tokeizer = AutoTokeizer.from_pretraied(model_id)
model = AutoModel.from_pretraied(model_id).cuda()
docs = [
trasform_query('A ma is eatig a piece of bread'),
"A ma is eatig food.",
"A ma is eatig pasta.",
"The girl is carryig a baby.",
"A ma is ridig a horse.",
]
# 2. ecode
iputs = tokeizer(docs, paddig=True, retur_tesors='pt')
for k, v i iputs.items():
iputs[k] = v.cuda()
outputs = model(**iputs).last_hidde_state
embeddigs = poolig(outputs, iputs, 'cls')
similarities = cos_sim(embeddigs[0], embeddigs[1:])
prit('similarities:', similarities)
Trasformers.js
pm i @xeova/trasformers
import { pipelie, cos_sim } from '@xeova/trasformers';
// Create a feature extractio pipelie
cost extractor = await pipelie('feature-extractio', 'mixedbread-ai/mxbai-embed-large-v1', {
quatized: false, // Commet out this lie to use the quatized versio
});
// Geerate setece embeddigs
cost docs = [
'Represet this setece for searchig relevat passages: A ma is eatig a piece of bread',
'A ma is eatig food.',
'A ma is eatig pasta.',
'The girl is carryig a baby.',
'A ma is ridig a horse.',
]
cost output = await extractor(docs, { poolig: 'cls' });
// Compute similarity scores
cost [source_embeddigs, ...documet_embeddigs ] = output.tolist();
cost similarities = documet_embeddigs.map(x => cos_sim(source_embeddigs, x));
cosole.log(similarities); // [0.7919578577247139, 0.6369278664248345, 0.16512018371357193, 0.3620778366720027]
Usig API
Evaluatio
Model
Avg (56 datasets)
Classificatio (12 datasets)
Clusterig (11 datasets)
PairClassificatio (3 datasets)
Rerakig (4 datasets)
Retrieval (15 datasets)
STS (10 datasets)
Summarizatio (1 dataset)
75.64
46.71
87.2
60.11
54.39
85.00
32.71
bge-large-e-v1.5
64.23
75.97
46.08
87.12
60.03
54.29
83.11
31.61
mxbai-embed-2d-large-v1
63.25
74.14
46.07
85.89
58.94
51.42
84.9
31.55
omic-embed-text-v1
62.39
74.12
43.91
85.15
55.69
52.81
82.06
30.08
jia-embeddigs-v2-base-e
60.38
73.45
41.73
85.38
56.98
47.87
80.7
31.6
Proprietary Models
OpeAI text-embeddig-3-large
64.58
75.45
49.01
85.72
59.16
55.44
81.73
29.92
Cohere embed-eglish-v3.0
64.47
76.49
47.43
85.84
58.01
55.00
82.62
30.18
OpeAI text-embeddig-ada-002
60.99
70.93
45.90
84.89
56.32
49.25
80.97
30.80
Commuity
Licese
Citatio
@olie{emb2024mxbai,
title={Ope Source Strikes Bread - New Fluffy Embeddigs Model},
author={Sea Lee, Aamir Shakir, Darius Koeig, Julius Lipp},
year={2024},
url={https://www.mixedbread.ai/blog/mxbai-embed-large-v1},
}
@article{li2023agle,
title={AglE-optimized Text Embeddigs},
author={Li, Xiamig ad Li, Jig},
joural={arXiv preprit arXiv:2309.12871},
year={2023}
}
点击空白处退出提示
评论