GTE文本向量-Qwen2-1.5B

我要开发同款
匿名用户2024年07月31日
82阅读

技术信息

开源地址
https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct
授权协议
Apache License 2.0

作品详情

gte-Qwe2-1.5B-istruct

gte-Qwe2-1.5B-istruct is the latest additio to the gte embeddig family. This model has bee egieered startig from the Qwe2-1.5B LLM, drawig o the robust atural laguage processig capabilities of the Qwe2-1.5B model. Ehaced through our sophisticated embeddig traiig techiques, the model icorporates several key advacemets:

  • Itegratio of bidirectioal attetio mechaisms, erichig its cotextual uderstadig.
  • Istructio tuig, applied solely o the query side for streamlied efficiecy
  • Comprehesive traiig across a vast, multiligual text corpus spaig diverse domais ad scearios. This traiig leverages both weakly supervised ad supervised data, esurig the model's applicability across umerous laguages ad a wide array of dowstream tasks.

Model Iformatio

  • Model Size: 1.5
  • Embeddig Dimesio: 1536
  • Max Iput Tokes: 32k

模型下载

SDK下载

#安装ModelScope
pip istall modelscope
pip istall setece_trasformers
#SDK模型下载
from modelscope import sapshot_dowload
model_dir = sapshot_dowload('iic/gte_Qwe2-1.5B-istruct')

Git下载

#Git模型下载
git cloe https://www.modelscope.c/iic/gte_Qwe2-1.5B-istruct.git

Requiremets

trasformers>=4.39.2
flash_att>=2.5.6

Usage

Setece Trasformers

from setece_trasformers import SeteceTrasformer
from modelscope import sapshot_dowload
model_dir = sapshot_dowload("iic/gte_Qwe2-1.5B-istruct")

model = SeteceTrasformer(model_dir, trust_remote_code=True)
# I case you wat to reduce the maximum legth:
model.max_seq_legth = 8192

queries = [
    "how much protei should a female eat",
    "summit defie",
]
documets = [
    "As a geeral guidelie, the CDC's average requiremet of protei for wome ages 19 to 70 is 46 grams per day. But, as you ca see from this chart, you'll eed to icrease that if you're expectig or traiig for a maratho. Check out the chart below to see how much protei you should be eatig each day.",
    "Defiitio of summit for Eglish Laguage Learers. : 1  the highest poit of a moutai : the top of a moutai. : 2  the highest level. : 3  a meetig or series of meetigs betwee the leaders of two or more govermets.",
]

query_embeddigs = model.ecode(queries, prompt_ame="query")
documet_embeddigs = model.ecode(documets)

scores = (query_embeddigs @ documet_embeddigs.T) * 100
prit(scores.tolist())
# [[70.00668334960938, 8.184843063354492], [14.62419319152832, 77.71407318115234]]

Observe the cofigsetecetrasformers.jso to see all pre-built prompt ames. Otherwise, you ca use model.ecode(queries, prompt="Istruct: ...\Query: " to use a custom prompt of your choice.

Trasformers

import torch
import torch..fuctioal as F

from torch import Tesor
from modelscope import AutoTokeizer, AutoModel


def last_toke_pool(last_hidde_states: Tesor,
                 attetio_mask: Tesor) -> Tesor:
    left_paddig = (attetio_mask[:, -1].sum() == attetio_mask.shape[0])
    if left_paddig:
        retur last_hidde_states[:, -1]
    else:
        sequece_legths = attetio_mask.sum(dim=1) - 1
        batch_size = last_hidde_states.shape[0]
        retur last_hidde_states[torch.arage(batch_size, device=last_hidde_states.device), sequece_legths]


def get_detailed_istruct(task_descriptio: str, query: str) -> str:
    retur f'Istruct: {task_descriptio}\Query: {query}'


# Each query must come with a oe-setece istructio that describes the task
task = 'Give a web search query, retrieve relevat passages that aswer the query'
queries = [
    get_detailed_istruct(task, 'how much protei should a female eat'),
    get_detailed_istruct(task, 'summit defie')
]
# No eed to add istructio for retrieval documets
documets = [
    "As a geeral guidelie, the CDC's average requiremet of protei for wome ages 19 to 70 is 46 grams per day. But, as you ca see from this chart, you'll eed to icrease that if you're expectig or traiig for a maratho. Check out the chart below to see how much protei you should be eatig each day.",
    "Defiitio of summit for Eglish Laguage Learers. : 1  the highest poit of a moutai : the top of a moutai. : 2  the highest level. : 3  a meetig or series of meetigs betwee the leaders of two or more govermets."
]
iput_texts = queries + documets

tokeizer = AutoTokeizer.from_pretraied('iic/gte_Qwe2-1.5B-istruct', trust_remote_code=True)
model = AutoModel.from_pretraied('iic/gte_Qwe2-1.5B-istruct', trust_remote_code=True)

max_legth = 8192

# Tokeize the iput texts
batch_dict = tokeizer(iput_texts, max_legth=max_legth, paddig=True, trucatio=True, retur_tesors='pt')
outputs = model(**batch_dict)
embeddigs = last_toke_pool(outputs.last_hidde_state, batch_dict['attetio_mask'])

# ormalize embeddigs
embeddigs = F.ormalize(embeddigs, p=2, dim=1)
scores = (embeddigs[:2] @ embeddigs[2:].T) * 100
prit(scores.tolist())
# [[70.00666809082031, 8.184867858886719], [14.62420654296875, 77.71405792236328]]

Evaluatio

MTEB & C-MTEB

You ca use the [scripts/eval_mteb.py]() to reproduce the followig result of gte-Qwe2-1.5B-istruct o MTEB(Eglish)/C-MTEB(Chiese):

Model Name MTEB(56) C-MTEB(35)
bge-base-e-1.5 64.23 -
bge-large-e-1.5 63.55 -
gte-large-e-v1.5 65.39 -
gte-base-e-v1.5 64.11 -
mxbai-embed-large-v1 64.68 -
acgetextembeddig - 69.07
stella-mrl-large-zh-v3.5-1792d - 68.55
gte-large-zh - 66.72
multiligual-e5-base 59.45 56.21
multiligual-e5-large 61.50 58.81
e5-mistral-7b-istruct 66.63 60.81
gte-Qwe1.5-7B-istruct 67.34 69.52
gte-Qwe2-7B-istruct 70.04 71.98
gte-Qwe2-1.5B-istruct 67.16 67.65

Citatio

If you fid our paper or models helpful, please cosider cite:

@article{li2023towards,
  title={Towards geeral text embeddigs with multi-stage cotrastive learig},
  author={Li, Zeha ad Zhag, Xi ad Zhag, Yazhao ad Log, Digku ad Xie, Pegju ad Zhag, Meisha},
  joural={arXiv preprit arXiv:2308.03281},
  year={2023}
}

功能介绍

gte-Qwen2-1.5B-instruct gte-Qwen2-1.5B-instruct is the latest addition to the gte embedding family.

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论