Fie-tued facebook/wav2vec2-large-xlsr-53 o Eglish usig the trai ad validatio splits of Commo Voice 6.1.
Whe usig this model, make sure that your speech iput is sampled at 16kHz. This model has bee fie-tued thaks to the GPU credits geerously give by the OVHcloud :) The script used for traiig ca be foud here: https://github.com/joatasgrosma/wav2vec2-sprit The model ca be used directly (without a laguage model) as follows… Usig the HuggigSoud library: Writig your ow iferece script: If you wat to cite this model you ca use this:Fie-tued XLSR-53 large model for speech recogitio i Eglish
Usage
from modelscope import sapshot_dowload
from huggigsoud import SpeechRecogitioModel
local_model = sapshot_dowload("AI-ModelScope/wav2vec2-large-xlsr-53-eglish",revisio='master')
model = SpeechRecogitioModel(local_model)
audio_paths = ["/path/to/file.mp3", "/path/to/aother_file.wav"]
trascriptios = model.trascribe(audio_paths)
import torch
import librosa
from datasets import load_dataset
from trasformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from modelscope import sapshot_dowload
LANG_ID = "e"
MODEL_ID = "AI-ModelScope/wav2vec2-large-xlsr-53-eglish"
SAMPLES = 10
test_dataset = load_dataset("commo_voice", LANG_ID, split=f"test[:{SAMPLES}]")
local_model = sapshot_dowload(MODEL_ID,revisio='master')
processor = Wav2Vec2Processor.from_pretraied(local_model)
model = Wav2Vec2ForCTC.from_pretraied(local_model)
# Preprocessig the datasets.
# We eed to read the audio files as arrays
def speech_file_to_array_f(batch):
speech_array, samplig_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["setece"] = batch["setece"].upper()
retur batch
test_dataset = test_dataset.map(speech_file_to_array_f)
iputs = processor(test_dataset["speech"], samplig_rate=16_000, retur_tesors="pt", paddig=True)
with torch.o_grad():
logits = model(iputs.iput_values, attetio_mask=iputs.attetio_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_seteces = processor.batch_decode(predicted_ids)
for i, predicted_setece i eumerate(predicted_seteces):
prit("-" * 100)
prit("Referece:", test_dataset[i]["setece"])
prit("Predictio:", predicted_setece)
Referece
Predictio
"SHE'LL BE ALL RIGHT."
SHE'LL BE ALL RIGHT
SIX
SIX
"ALL'S WELL THAT ENDS WELL."
ALL AS WELL THAT ENDS WELL
DO YOU MEAN IT?
DO YOU MEAN IT
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS.
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE?
HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q
"I GUESS YOU MUST THINK I'M KINDA BATTY."
RUSTIAN WASTIN PAN ONTE BATTLY
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING?
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER.
SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD.
GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD
Evaluatio
mozilla-foudatio/commo_voice_6_0
with split test
pytho eval.py --model_id joatasgrosma/wav2vec2-large-xlsr-53-eglish --dataset mozilla-foudatio/commo_voice_6_0 --cofig e --split test
speech-recogitio-commuity-v2/dev_data
pytho eval.py --model_id joatasgrosma/wav2vec2-large-xlsr-53-eglish --dataset speech-recogitio-commuity-v2/dev_data --cofig e --split validatio --chuk_legth_s 5.0 --stride_legth_s 1.0
Citatio
@misc{grosma2021xlsr53-large-eglish,
title={Fie-tued {XLSR}-53 large model for speech recogitio i {E}glish},
author={Grosma, Joatas},
howpublished={\url{https://huggigface.co/joatasgrosma/wav2vec2-large-xlsr-53-eglish}},
year={2021}
}
点击空白处退出提示
评论