GTE文本向量-Qwen2-1.5B_开源AI项目-程序员客栈

开源地址
https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct授权协议
Apache License 2.0

gte-Qwe2-1.5B-istruct

gte-Qwe2-1.5B-istruct is the latest additio to the gte embeddig family. This model has bee egieered startig from the Qwe2-1.5B LLM, drawig o the robust atural laguage processig capabilities of the Qwe2-1.5B model. Ehaced through our sophisticated embeddig traiig techiques, the model icorporates several key advacemets:

Itegratio of bidirectioal attetio mechaisms, erichig its cotextual uderstadig.
Istructio tuig, applied solely o the query side for streamlied efficiecy
Comprehesive traiig across a vast, multiligual text corpus spaig diverse domais ad scearios. This traiig leverages both weakly supervised ad supervised data, esurig the model's applicability across umerous laguages ad a wide array of dowstream tasks.

Model Iformatio

Model Size: 1.5
Embeddig Dimesio: 1536
Max Iput Tokes: 32k

模型下载

SDK下载

#安装ModelScope
pip istall modelscope
pip istall setece_trasformers

#SDK模型下载
from modelscope import sapshot_dowload
model_dir = sapshot_dowload('iic/gte_Qwe2-1.5B-istruct')

Git下载

#Git模型下载
git cloe https://www.modelscope.c/iic/gte_Qwe2-1.5B-istruct.git

Requiremets

trasformers>=4.39.2
flash_att>=2.5.6

Usage

Setece Trasformers

from setece_trasformers import SeteceTrasformer
from modelscope import sapshot_dowload
model_dir = sapshot_dowload("iic/gte_Qwe2-1.5B-istruct")

model = SeteceTrasformer(model_dir, trust_remote_code=True)
# I case you wat to reduce the maximum legth:
model.max_seq_legth = 8192

queries = [
    "how much protei should a female eat",
    "summit defie",
]
documets = [
    "As a geeral guidelie, the CDC's average requiremet of protei for wome ages 19 to 70 is 46 grams per day. But, as you ca see from this chart, you'll eed to icrease that if you're expectig or traiig for a maratho. Check out the chart below to see how much protei you should be eatig each day.",
    "Defiitio of summit for Eglish Laguage Learers. : 1  the highest poit of a moutai : the top of a moutai. : 2  the highest level. : 3  a meetig or series of meetigs betwee the leaders of two or more govermets.",
]

query_embeddigs = model.ecode(queries, prompt_ame="query")
documet_embeddigs = model.ecode(documets)

scores = (query_embeddigs @ documet_embeddigs.T) * 100
prit(scores.tolist())
# [[70.00668334960938, 8.184843063354492], [14.62419319152832, 77.71407318115234]]

Observe the cofigsetecetrasformers.jso to see all pre-built prompt ames. Otherwise, you ca use model.ecode(queries, prompt="Istruct: ...\Query: " to use a custom prompt of your choice.

Trasformers

import torch
import torch..fuctioal as F

from torch import Tesor
from modelscope import AutoTokeizer, AutoModel


def last_toke_pool(last_hidde_states: Tesor,
                 attetio_mask: Tesor) -> Tesor:
    left_paddig = (attetio_mask[:, -1].sum() == attetio_mask.shape[0])
    if left_paddig:
        retur last_hidde_states[:, -1]
    else:
        sequece_legths = attetio_mask.sum(dim=1) - 1
        batch_size = last_hidde_states.shape[0]
        retur last_hidde_states[torch.arage(batch_size, device=last_hidde_states.device), sequece_legths]


def get_detailed_istruct(task_descriptio: str, query: str) -> str:
    retur f'Istruct: {task_descriptio}\Query: {query}'


# Each query must come with a oe-setece istructio that describes the task
task = 'Give a web search query, retrieve relevat passages that aswer the query'
queries = [
    get_detailed_istruct(task, 'how much protei should a female eat'),
    get_detailed_istruct(task, 'summit defie')
]
# No eed to add istructio for retrieval documets
documets = [
    "As a geeral guidelie, the CDC's average requiremet of protei for wome ages 19 to 70 is 46 grams per day. But, as you ca see from this chart, you'll eed to icrease that if you're expectig or traiig for a maratho. Check out the chart below to see how much protei you should be eatig each day.",
    "Defiitio of summit for Eglish Laguage Learers. : 1  the highest poit of a moutai : the top of a moutai. : 2  the highest level. : 3  a meetig or series of meetigs betwee the leaders of two or more govermets."
]
iput_texts = queries + documets

tokeizer = AutoTokeizer.from_pretraied('iic/gte_Qwe2-1.5B-istruct', trust_remote_code=True)
model = AutoModel.from_pretraied('iic/gte_Qwe2-1.5B-istruct', trust_remote_code=True)

max_legth = 8192

# Tokeize the iput texts
batch_dict = tokeizer(iput_texts, max_legth=max_legth, paddig=True, trucatio=True, retur_tesors='pt')
outputs = model(**batch_dict)
embeddigs = last_toke_pool(outputs.last_hidde_state, batch_dict['attetio_mask'])

# ormalize embeddigs
embeddigs = F.ormalize(embeddigs, p=2, dim=1)
scores = (embeddigs[:2] @ embeddigs[2:].T) * 100
prit(scores.tolist())
# [[70.00666809082031, 8.184867858886719], [14.62420654296875, 77.71405792236328]]

Evaluatio

MTEB & C-MTEB

You ca use the [scripts/eval_mteb.py]() to reproduce the followig result of gte-Qwe2-1.5B-istruct o MTEB(Eglish)/C-MTEB(Chiese):

Model Name	MTEB(56)	C-MTEB(35)
bge-base-e-1.5	64.23	-
bge-large-e-1.5	63.55	-
gte-large-e-v1.5	65.39	-
gte-base-e-v1.5	64.11	-
mxbai-embed-large-v1	64.68	-
acgetextembeddig	-	69.07
stella-mrl-large-zh-v3.5-1792d	-	68.55
gte-large-zh	-	66.72
multiligual-e5-base	59.45	56.21
multiligual-e5-large	61.50	58.81
e5-mistral-7b-istruct	66.63	60.81
gte-Qwe1.5-7B-istruct	67.34	69.52
gte-Qwe2-7B-istruct	70.04	71.98
gte-Qwe2-1.5B-istruct	67.16	67.65

Citatio

If you fid our paper or models helpful, please cosider cite:

@article{li2023towards,
  title={Towards geeral text embeddigs with multi-stage cotrastive learig},
  author={Li, Zeha ad Zhag, Xi ad Zhag, Yazhao ad Log, Digku ad Xie, Pegju ad Zhag, Meisha},
  joural={arXiv preprit arXiv:2308.03281},
  year={2023}
}

gte-Qwen2-1.5B-instruct gte-Qwen2-1.5B-instruct is the latest addition to the gte embedding family.

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

GTE文本向量-Qwen2-1.5B

技术信息

作品详情