mxbai-embed-large-v1

我要开发同款
匿名用户2024年07月31日
51阅读
所属分类ai、bert、Pytorch、transformers、transformers.js、mteb
开源地址https://modelscope.cn/models/mirror013/mxbai-embed-large-v1
授权协议apache-2.0

作品详情



The crispy sentence embedding family from mixedbread ai.

mxbai-embed-large-v1

This is our base sentence embedding model. It was trained using AnglE loss on our high-quality large scale data. It achieves SOTA performance on BERT-large scale. Find out more in our blog post.

Quickstart

Here, we provide several ways to produce sentence embeddings. Please note that you have to provide the prompt Represent this sentence for searching relevant passages: for query if you want to use it for retrieval. Besides that you don't need any prompt.

sentence-transformers

python -m pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# 1. load model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# For retrieval you need to pass this prompt.
query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

# 2. Encode
embeddings = model.encode(docs)

similarities = cos_sim(embeddings[0], embeddings[1:])
print('similarities:', similarities)

Transformers

from typing import Dict

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
from sentence_transformers.util import cos_sim

# For retrieval you need to pass this prompt. Please find our more in our blog post.
def transform_query(query: str) -> str:
    """ For retrieval, add the prompt for query (not for documents).
    """
    return f'Represent this sentence for searching relevant passages: {query}'

# The model works really well with cls pooling (default) but also with mean poolin.
def pooling(outputs: torch.Tensor, inputs: Dict,  strategy: str = 'cls') -> np.ndarray:
    if strategy == 'cls':
        outputs = outputs[:, 0]
    elif strategy == 'mean':
        outputs = torch.sum(
            outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"])
    else:
        raise NotImplementedError
    return outputs.detach().cpu().numpy()

# 1. load model
model_id = 'mixedbread-ai/mxbai-embed-large-v1'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).cuda()


docs = [
    transform_query('A man is eating a piece of bread'),
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

# 2. encode
inputs = tokenizer(docs, padding=True, return_tensors='pt')
for k, v in inputs.items():
    inputs[k] = v.cuda()
outputs = model(**inputs).last_hidden_state
embeddings = pooling(outputs, inputs, 'cls')

similarities = cos_sim(embeddings[0], embeddings[1:])
print('similarities:', similarities)

Transformers.js

If you haven't already, you can install the Transformers.js JavaScript library from NPM using:

npm i @xenova/transformers

You can then use the model to compute embeddings like this:

import { pipeline, cos_sim } from '@xenova/transformers';

// Create a feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'mixedbread-ai/mxbai-embed-large-v1', {
    quantized: false, // Comment out this line to use the quantized version
});

// Generate sentence embeddings
const docs = [
    'Represent this sentence for searching relevant passages: A man is eating a piece of bread',
    'A man is eating food.',
    'A man is eating pasta.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
]
const output = await extractor(docs, { pooling: 'cls' });

// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => cos_sim(source_embeddings, x));
console.log(similarities); // [0.7919578577247139, 0.6369278664248345, 0.16512018371357193, 0.3620778366720027]

Using API

You’ll be able to use the models through our API as well. The API is coming soon and will have some exciting features. Stay tuned!

Evaluation

As of March 2024, our model archives SOTA performance for Bert-large sized models on the MTEB. It ourperforms commercial models like OpenAIs text-embedding-3-large and matches the performance of model 20x it's size like the echo-mistral-7b. Our model was trained with no overlap of the MTEB data, which indicates that our model generalizes well across several domains, tasks and text length. We know there are some limitations with this model, which will be fixed in v2.

Model Avg (56 datasets) Classification (12 datasets) Clustering (11 datasets) PairClassification (3 datasets) Reranking (4 datasets) Retrieval (15 datasets) STS (10 datasets) Summarization (1 dataset)
mxbai-embed-large-v1 64.68 75.64 46.71 87.2 60.11 54.39 85.00 32.71
bge-large-en-v1.5 64.23 75.97 46.08 87.12 60.03 54.29 83.11 31.61
mxbai-embed-2d-large-v1 63.25 74.14 46.07 85.89 58.94 51.42 84.9 31.55
nomic-embed-text-v1 62.39 74.12 43.91 85.15 55.69 52.81 82.06 30.08
jina-embeddings-v2-base-en 60.38 73.45 41.73 85.38 56.98 47.87 80.7 31.6
Proprietary Models
OpenAI text-embedding-3-large 64.58 75.45 49.01 85.72 59.16 55.44 81.73 29.92
Cohere embed-english-v3.0 64.47 76.49 47.43 85.84 58.01 55.00 82.62 30.18
OpenAI text-embedding-ada-002 60.99 70.93 45.90 84.89 56.32 49.25 80.97 30.80

Please find more information in our blog post.

Community

Please join our Discord Community and share your feedback and thoughts! We are here to help and also always happy to chat.

License

Apache 2.0

Citation

@online{emb2024mxbai,
  title={Open Source Strikes Bread - New Fluffy Embeddings Model},
  author={Sean Lee, Aamir Shakir, Darius Koenig, Julius Lipp},
  year={2024},
  url={https://www.mixedbread.ai/blog/mxbai-embed-large-v1},
}

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论