all-mpnet-base-v2

我要开发同款
匿名用户2024年07月31日
52阅读

技术信息

开源地址
https://modelscope.cn/models/AI-ModelScope/all-mpnet-base-v2
授权协议
Apache License 2.0

作品详情

all-mpet-base-v2

This is a setece-trasformers model: It maps seteces & paragraphs to a 768 dimesioal dese vector space ad ca be used for tasks like clusterig or sematic search.

Usage (Setece-Trasformers)

Usig this model becomes easy whe you have setece-trasformers istalled:

pip istall -U setece-trasformers

The you ca use the model like this:

from setece_trasformers import SeteceTrasformer
seteces = ["This is a example setece", "Each setece is coverted"]

model = SeteceTrasformer('setece-trasformers/all-mpet-base-v2')
embeddigs = model.ecode(seteces)
prit(embeddigs)

Usage (HuggigFace Trasformers)

Without setece-trasformers, you ca use the model like this: First, you pass your iput through the trasformer model, the you have to apply the right poolig-operatio o-top of the cotextualized word embeddigs.

from trasformers import AutoTokeizer, AutoModel
import torch
import torch..fuctioal as F

#Mea Poolig - Take attetio mask ito accout for correct averagig
def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0] #First elemet of model_output cotais all toke embeddigs
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)


# Seteces we wat setece embeddigs for
seteces = ['This is a example setece', 'Each setece is coverted']

# Load model from HuggigFace Hub
tokeizer = AutoTokeizer.from_pretraied('setece-trasformers/all-mpet-base-v2')
model = AutoModel.from_pretraied('setece-trasformers/all-mpet-base-v2')

# Tokeize seteces
ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

# Compute toke embeddigs
with torch.o_grad():
    model_output = model(**ecoded_iput)

# Perform poolig
setece_embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])

# Normalize embeddigs
setece_embeddigs = F.ormalize(setece_embeddigs, p=2, dim=1)

prit("Setece embeddigs:")
prit(setece_embeddigs)

Evaluatio Results

For a automated evaluatio of this model, see the Setece Embeddigs Bechmark: https://seb.sbert.et


Backgroud

The project aims to trai setece embeddig models o very large setece level datasets usig a self-supervised cotrastive learig objective. We used the pretraied microsoft/mpet-base model ad fie-tued i o a 1B setece pairs dataset. We use a cotrastive learig objective: give a setece from the pair, the model should predict which out of a set of radomly sampled other seteces, was actually paired with it i our dataset.

We developped this model durig the Commuity week usig JAX/Flax for NLP & CV, orgaized by Huggig Face. We developped this model as part of the project: Trai the Best Setece Embeddig Model Ever with 1B Traiig Pairs. We beefited from efficiet hardware ifrastructure to ru the project: 7 TPUs v3-8, as well as itervetio from Googles Flax, JAX, ad Cloud team member about efficiet deep learig frameworks.

Iteded uses

Our model is iteted to be used as a setece ad short paragraph ecoder. Give a iput text, it ouptuts a vector which captures the sematic iformatio. The setece vector may be used for iformatio retrieval, clusterig or setece similarity tasks.

By default, iput text loger tha 384 word pieces is trucated.

Traiig procedure

Pre-traiig

We use the pretraied microsoft/mpet-base model. Please refer to the model card for more detailed iformatio about the pre-traiig procedure.

Fie-tuig

We fie-tue the model usig a cotrastive objective. Formally, we compute the cosie similarity from each possible setece pairs from the batch. We the apply the cross etropy loss by comparig with true pairs.

Hyper parameters

We traied ou model o a TPU v3-8. We trai the model durig 100k steps usig a batch size of 1024 (128 per TPU core). We use a learig rate warm up of 500. The sequece legth was limited to 128 tokes. We used the AdamW optimizer with a 2e-5 learig rate. The full traiig script is accessible i this curret repository: trai_script.py.

Traiig data

We use the cocateatio from multiple datasets to fie-tue our model. The total umber of setece pairs is above 1 billio seteces. We sampled each dataset give a weighted probability which cofiguratio is detailed i the data_cofig.jso file.

Dataset Paper Number of traiig tuples
Reddit commets (2015-2018) paper 726,484,430
S2ORC Citatio pairs (Abstracts) paper 116,288,806
WikiAswers Duplicate questio pairs paper 77,427,422
PAQ (Questio, Aswer) pairs paper 64,371,441
S2ORC Citatio pairs (Titles) paper 52,603,982
S2ORC (Title, Abstract) paper 41,769,185
Stack Exchage (Title, Body) pairs - 25,316,456
Stack Exchage (Title+Body, Aswer) pairs - 21,396,559
Stack Exchage (Title, Aswer) pairs - 21,396,559
MS MARCO triplets paper 9,144,553
GOOAQ: Ope Questio Aswerig with Diverse Aswer Types paper 3,012,496
Yahoo Aswers (Title, Aswer) paper 1,198,260
Code Search - 1,151,414
COCO Image captios paper 828,395
SPECTER citatio triplets paper 684,100
Yahoo Aswers (Questio, Aswer) paper 681,164
Yahoo Aswers (Title, Questio) paper 659,896
SearchQA paper 582,261
Eli5 paper 325,475
Flickr 30k paper 317,695
Stack Exchage Duplicate questios (titles) 304,525
AllNLI (SNLI ad MultiNLI paper SNLI, paper MultiNLI 277,230
Stack Exchage Duplicate questios (bodies) 250,519
Stack Exchage Duplicate questios (titles+bodies) 250,460
Setece Compressio paper 180,000
Wikihow paper 128,542
Altlex paper 112,696
Quora Questio Triplets - 103,663
Simple Wikipedia paper 102,225
Natural Questios (NQ) paper 100,231
SQuAD2.0 paper 87,599
TriviaQA - 73,346
Total 1,170,060,424

功能介绍

all-mpnet-base-v2 This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dim

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论