匿名用户2024年07月31日
32阅读
所属分类ai、bert、Pytorch、Sentence Transformer、sentence-transformer、sentence-similarity、mteb
开源地址https://modelscope.cn/models/AI-ModelScope/gte-base-zh
授权协议mit

作品详情

gte-base-zh

General Text Embeddings (GTE) model. Towards General Text Embeddings with Multi-stage Contrastive Learning

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer different sizes of models for both Chinese and English Languages. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Model List

Models Language Max Sequence Length Dimension Model Size
GTE-large-zh Chinese 512 1024 0.67GB
GTE-base-zh Chinese 512 1024 0.67GB
GTE-small-zh Chinese 512 1024 0.67GB
GTE-large English 512 1024 0.67GB
GTE-base English 512 1024 0.67GB
GTE-small English 512 1024 0.67GB

Metrics

We compared the performance of the GTE models with other popular text embedding models on the MTEB (CMTEB for Chinese language) benchmark. For more detailed comparison results, please refer to the MTEB leaderboard.

  • Evaluation results on CMTEB
Model Model Size (GB) Embedding Dimensions Sequence Length Average (35 datasets) Classification (9 datasets) Clustering (4 datasets) Pair Classification (2 datasets) Reranking (4 datasets) Retrieval (8 datasets) STS (8 datasets)
gte-large-zh 0.65 1024 512 66.72 71.34 53.07 81.14 67.42 72.49 57.82
gte-base-zh 0.20 768 512 65.92 71.26 53.86 80.44 67.00 71.71 55.96
stella-large-zh-v2 0.65 1024 1024 65.13 69.05 49.16 82.68 66.41 70.14 58.66
stella-large-zh 0.65 1024 1024 64.54 67.62 48.65 78.72 65.98 71.02 58.3
bge-large-zh-v1.5 1.3 1024 512 64.53 69.13 48.99 81.6 65.84 70.46 56.25
stella-base-zh-v2 0.21 768 1024 64.36 68.29 49.4 79.96 66.1 70.08 56.92
stella-base-zh 0.21 768 1024 64.16 67.77 48.7 76.09 66.95 71.07 56.54
piccolo-large-zh 0.65 1024 512 64.11 67.03 47.04 78.38 65.98 70.93 58.02
piccolo-base-zh 0.2 768 512 63.66 66.98 47.12 76.61 66.68 71.2 55.9
gte-small-zh 0.1 512 512 60.08 64.49 48.95 69.99 66.21 65.50 49.72
bge-small-zh-v1.5 0.1 512 512 57.82 63.96 44.18 70.4 60.92 61.77 49.1
m3e-base 0.41 768 512 57.79 67.52 47.68 63.99 59.54 56.91 50.47
text-embedding-ada-002(openai) - 1536 8192 53.02 64.31 45.68 69.56 54.28 52.0 43.35

Usage

Code example

import torch.nn.functional as F
from torch import Tensor
from modelscope import AutoTokenizer, AutoModel

input_texts = [
    "中国的首都是哪里",
    "你喜欢去哪里旅游",
    "北京",
    "今天中午吃什么"
]

tokenizer = AutoTokenizer.from_pretrained("AI-ModelScope/gte-base-zh")
model = AutoModel.from_pretrained("AI-ModelScope/gte-base-zh")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Use with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['中国的首都是哪里', '中国的首都是北京']

model = SentenceTransformer('thenlper/gte-base-zh')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Limitation

This model exclusively caters to Chinese texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

Citation

If you find our paper or models helpful, please consider citing them as follows:

@misc{li2023general,
      title={Towards General Text Embeddings with Multi-stage Contrastive Learning},
      author={Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang},
      year={2023},
      eprint={2308.03281},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论