This is a setece-trasformers model: It maps seteces & paragraphs to a 768 dimesioal dese vector space ad ca be used for tasks like clusterig or sematic search. Usig this model becomes easy whe you have setece-trasformers istalled: The you ca use the model like this: Without setece-trasformers, you ca use the model like this: First, you pass your iput through the trasformer model, the you have to apply the right poolig-operatio o-top of the cotextualized word embeddigs. For a automated evaluatio of this model, see the Setece Embeddigs Bechmark: https://seb.sbert.et The project aims to trai setece embeddig models o very large setece level datasets usig a self-supervised
cotrastive learig objective. We used the pretraied We developped this model durig the
Commuity week usig JAX/Flax for NLP & CV,
orgaized by Huggig Face. We developped this model as part of the project:
Trai the Best Setece Embeddig Model Ever with 1B Traiig Pairs. We beefited from efficiet hardware ifrastructure to ru the project: 7 TPUs v3-8, as well as itervetio from Googles Flax, JAX, ad Cloud team member about efficiet deep learig frameworks. Our model is iteted to be used as a setece ad short paragraph ecoder. Give a iput text, it ouptuts a vector which captures
the sematic iformatio. The setece vector may be used for iformatio retrieval, clusterig or setece similarity tasks. By default, iput text loger tha 384 word pieces is trucated. We use the pretraied We fie-tue the model usig a cotrastive objective. Formally, we compute the cosie similarity from each possible setece pairs from the batch.
We the apply the cross etropy loss by comparig with true pairs. We traied ou model o a TPU v3-8. We trai the model durig 100k steps usig a batch size of 1024 (128 per TPU core).
We use a learig rate warm up of 500. The sequece legth was limited to 128 tokes. We used the AdamW optimizer with
a 2e-5 learig rate. The full traiig script is accessible i this curret repository: We use the cocateatio from multiple datasets to fie-tue our model. The total umber of setece pairs is above 1 billio seteces.
We sampled each dataset give a weighted probability which cofiguratio is detailed i the all-mpet-base-v2
Usage (Setece-Trasformers)
pip istall -U setece-trasformers
from setece_trasformers import SeteceTrasformer
seteces = ["This is a example setece", "Each setece is coverted"]
model = SeteceTrasformer('setece-trasformers/all-mpet-base-v2')
embeddigs = model.ecode(seteces)
prit(embeddigs)
Usage (HuggigFace Trasformers)
from trasformers import AutoTokeizer, AutoModel
import torch
import torch..fuctioal as F
#Mea Poolig - Take attetio mask ito accout for correct averagig
def mea_poolig(model_output, attetio_mask):
toke_embeddigs = model_output[0] #First elemet of model_output cotais all toke embeddigs
iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)
# Seteces we wat setece embeddigs for
seteces = ['This is a example setece', 'Each setece is coverted']
# Load model from HuggigFace Hub
tokeizer = AutoTokeizer.from_pretraied('setece-trasformers/all-mpet-base-v2')
model = AutoModel.from_pretraied('setece-trasformers/all-mpet-base-v2')
# Tokeize seteces
ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')
# Compute toke embeddigs
with torch.o_grad():
model_output = model(**ecoded_iput)
# Perform poolig
setece_embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
# Normalize embeddigs
setece_embeddigs = F.ormalize(setece_embeddigs, p=2, dim=1)
prit("Setece embeddigs:")
prit(setece_embeddigs)
Evaluatio Results
Backgroud
microsoft/mpet-base
model ad fie-tued i o a
1B setece pairs dataset. We use a cotrastive learig objective: give a setece from the pair, the model should predict which out of a set of radomly sampled other seteces, was actually paired with it i our dataset.Iteded uses
Traiig procedure
Pre-traiig
microsoft/mpet-base
model. Please refer to the model card for more detailed iformatio about the pre-traiig procedure.Fie-tuig
Hyper parameters
trai_script.py
.Traiig data
data_cofig.jso
file.
Dataset
Paper
Number of traiig tuples
Reddit commets (2015-2018)
paper
726,484,430
S2ORC Citatio pairs (Abstracts)
paper
116,288,806
WikiAswers Duplicate questio pairs
paper
77,427,422
PAQ (Questio, Aswer) pairs
paper
64,371,441
S2ORC Citatio pairs (Titles)
paper
52,603,982
S2ORC (Title, Abstract)
paper
41,769,185
Stack Exchage (Title, Body) pairs
-
25,316,456
Stack Exchage (Title+Body, Aswer) pairs
-
21,396,559
Stack Exchage (Title, Aswer) pairs
-
21,396,559
MS MARCO triplets
paper
9,144,553
GOOAQ: Ope Questio Aswerig with Diverse Aswer Types
paper
3,012,496
Yahoo Aswers (Title, Aswer)
paper
1,198,260
Code Search
-
1,151,414
COCO Image captios
paper
828,395
SPECTER citatio triplets
paper
684,100
Yahoo Aswers (Questio, Aswer)
paper
681,164
Yahoo Aswers (Title, Questio)
paper
659,896
SearchQA
paper
582,261
Eli5
paper
325,475
Flickr 30k
paper
317,695
Stack Exchage Duplicate questios (titles)
304,525
AllNLI (SNLI ad MultiNLI
paper SNLI, paper MultiNLI
277,230
Stack Exchage Duplicate questios (bodies)
250,519
Stack Exchage Duplicate questios (titles+bodies)
250,460
Setece Compressio
paper
180,000
Wikihow
paper
128,542
Altlex
paper
112,696
Quora Questio Triplets
-
103,663
Simple Wikipedia
paper
102,225
Natural Questios (NQ)
paper
100,231
SQuAD2.0
paper
87,599
TriviaQA
-
73,346
点击空白处退出提示
评论