开源地址
https://modelscope.cn/models/AI-ModelScope/all-mpnet-base-v2授权协议
Apache License 2.0

all-mpet-base-v2

This is a setece-trasformers model: It maps seteces & paragraphs to a 768 dimesioal dese vector space ad ca be used for tasks like clusterig or sematic search.

Usage (Setece-Trasformers)

Usig this model becomes easy whe you have setece-trasformers istalled:

pip istall -U setece-trasformers

The you ca use the model like this:

from setece_trasformers import SeteceTrasformer
seteces = ["This is a example setece", "Each setece is coverted"]

model = SeteceTrasformer('setece-trasformers/all-mpet-base-v2')
embeddigs = model.ecode(seteces)
prit(embeddigs)

Usage (HuggigFace Trasformers)

Without setece-trasformers, you ca use the model like this: First, you pass your iput through the trasformer model, the you have to apply the right poolig-operatio o-top of the cotextualized word embeddigs.

from trasformers import AutoTokeizer, AutoModel
import torch
import torch..fuctioal as F

#Mea Poolig - Take attetio mask ito accout for correct averagig
def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0] #First elemet of model_output cotais all toke embeddigs
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)


# Seteces we wat setece embeddigs for
seteces = ['This is a example setece', 'Each setece is coverted']

# Load model from HuggigFace Hub
tokeizer = AutoTokeizer.from_pretraied('setece-trasformers/all-mpet-base-v2')
model = AutoModel.from_pretraied('setece-trasformers/all-mpet-base-v2')

# Tokeize seteces
ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

# Compute toke embeddigs
with torch.o_grad():
    model_output = model(**ecoded_iput)

# Perform poolig
setece_embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])

# Normalize embeddigs
setece_embeddigs = F.ormalize(setece_embeddigs, p=2, dim=1)

prit("Setece embeddigs:")
prit(setece_embeddigs)

Evaluatio Results

For a automated evaluatio of this model, see the Setece Embeddigs Bechmark: https://seb.sbert.et

Backgroud

The project aims to trai setece embeddig models o very large setece level datasets usig a self-supervised cotrastive learig objective. We used the pretraied microsoft/mpet-base model ad fie-tued i o a 1B setece pairs dataset. We use a cotrastive learig objective: give a setece from the pair, the model should predict which out of a set of radomly sampled other seteces, was actually paired with it i our dataset.

We developped this model durig the Commuity week usig JAX/Flax for NLP & CV, orgaized by Huggig Face. We developped this model as part of the project: Trai the Best Setece Embeddig Model Ever with 1B Traiig Pairs. We beefited from efficiet hardware ifrastructure to ru the project: 7 TPUs v3-8, as well as itervetio from Googles Flax, JAX, ad Cloud team member about efficiet deep learig frameworks.

Iteded uses

Our model is iteted to be used as a setece ad short paragraph ecoder. Give a iput text, it ouptuts a vector which captures the sematic iformatio. The setece vector may be used for iformatio retrieval, clusterig or setece similarity tasks.

By default, iput text loger tha 384 word pieces is trucated.

Traiig procedure

Pre-traiig

We use the pretraied microsoft/mpet-base model. Please refer to the model card for more detailed iformatio about the pre-traiig procedure.

Fie-tuig

We fie-tue the model usig a cotrastive objective. Formally, we compute the cosie similarity from each possible setece pairs from the batch. We the apply the cross etropy loss by comparig with true pairs.

Hyper parameters

We traied ou model o a TPU v3-8. We trai the model durig 100k steps usig a batch size of 1024 (128 per TPU core). We use a learig rate warm up of 500. The sequece legth was limited to 128 tokes. We used the AdamW optimizer with a 2e-5 learig rate. The full traiig script is accessible i this curret repository: trai_script.py.

Traiig data

We use the cocateatio from multiple datasets to fie-tue our model. The total umber of setece pairs is above 1 billio seteces. We sampled each dataset give a weighted probability which cofiguratio is detailed i the data_cofig.jso file.

Dataset	Paper	Number of traiig tuples
Reddit commets (2015-2018)	paper	726,484,430
S2ORC Citatio pairs (Abstracts)	paper	116,288,806
WikiAswers Duplicate questio pairs	paper	77,427,422
PAQ (Questio, Aswer) pairs	paper	64,371,441
S2ORC Citatio pairs (Titles)	paper	52,603,982
S2ORC (Title, Abstract)	paper	41,769,185
Stack Exchage (Title, Body) pairs	-	25,316,456
Stack Exchage (Title+Body, Aswer) pairs	-	21,396,559
Stack Exchage (Title, Aswer) pairs	-	21,396,559
MS MARCO triplets	paper	9,144,553
GOOAQ: Ope Questio Aswerig with Diverse Aswer Types	paper	3,012,496
Yahoo Aswers (Title, Aswer)	paper	1,198,260
Code Search	-	1,151,414
COCO Image captios	paper	828,395
SPECTER citatio triplets	paper	684,100
Yahoo Aswers (Questio, Aswer)	paper	681,164
Yahoo Aswers (Title, Questio)	paper	659,896
SearchQA	paper	582,261
Eli5	paper	325,475
Flickr 30k	paper	317,695
Stack Exchage Duplicate questios (titles)		304,525
AllNLI (SNLI ad MultiNLI	paper SNLI, paper MultiNLI	277,230
Stack Exchage Duplicate questios (bodies)		250,519
Stack Exchage Duplicate questios (titles+bodies)		250,460
Setece Compressio	paper	180,000
Wikihow	paper	128,542
Altlex	paper	112,696
Quora Questio Triplets	-	103,663
Simple Wikipedia	paper	102,225
Natural Questios (NQ)	paper	100,231
SQuAD2.0	paper	87,599
TriviaQA	-	73,346
Total		1,170,060,424

all-mpnet-base-v2 This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dim

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

all-mpnet-base-v2

技术信息

作品详情