The base model pretraied ad fie-tued o 960 hours of Librispeech o 16kHz sampled speech audio. Whe usig the model
make sure that your speech iput is also sampled at 16Khz. Authors: Alexei Baevski, Hery Zhou, Abdelrahma Mohamed, Michael Auli We show for the first time that learig powerful represetatios from speech audio aloe followed by fie-tuig o trascribed speech ca outperform the best semi-supervised methods while beig coceptually simpler. wav2vec 2.0 masks the speech iput i the latet space ad solves a cotrastive task defied over a quatizatio of the latet represetatios which are joitly leared. Experimets usig all labeled data of Librispeech achieve 1.8/3.3 WER o the clea/other test sets. Whe lowerig the amout of labeled data to oe hour, wav2vec 2.0 outperforms the previous state of the art o the 100 hour subset while usig 100 times less labeled data. Usig just te miutes of labeled data ad pre-traiig o 53k hours of ulabeled data still achieves 4.8/8.2 WER. This demostrates the feasibility of speech recogitio with limited amouts of labeled data. The origial model ca be foud uder https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To trascribe audio files the model ca be used as a stadaloe acoustic model as follows: ## Evaluatio This code sippet shows how to evaluate Result (WER):Wav2Vec2-Base-960h
Usage
from trasformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# load model ad tokeizer
processor = Wav2Vec2Processor.from_pretraied("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretraied("facebook/wav2vec2-base-960h")
# load dummy dataset ad read soudfiles
ds = load_dataset("patrickvoplate/librispeech_asr_dummy", "clea", split="validatio")
# tokeize
iput_values = processor(ds[0]["audio"]["array"], retur_tesors="pt", paddig="logest").iput_values # Batch size 1
# retrieve logits
logits = model(iput_values).logits
# take argmax ad decode
predicted_ids = torch.argmax(logits, dim=-1)
trascriptio = processor.batch_decode(predicted_ids)
from datasets import load_dataset
from trasformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clea", split="test")
model = Wav2Vec2ForCTC.from_pretraied("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretraied("facebook/wav2vec2-base-960h")
def map_to_pred(batch):
iput_values = processor(batch["audio"]["array"], retur_tesors="pt", paddig="logest").iput_values
with torch.o_grad():
logits = model(iput_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
trascriptio = processor.batch_decode(predicted_ids)
batch["trascriptio"] = trascriptio
retur batch
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_colums=["audio"])
prit("WER:", wer(result["text"], result["trascriptio"]))
"clea"
"other"
3.4
8.6
点击空白处退出提示
评论