wav2vec2-base-960h

我要开发同款
匿名用户2024年07月31日
21阅读

技术信息

开源地址
https://modelscope.cn/models/AI-ModelScope/wav2vec2-base-960h
授权协议
Apache License 2.0

作品详情

Wav2Vec2-Base-960h

Facebook's Wav2Vec2

The base model pretraied ad fie-tued o 960 hours of Librispeech o 16kHz sampled speech audio. Whe usig the model make sure that your speech iput is also sampled at 16Khz.

Paper

Authors: Alexei Baevski, Hery Zhou, Abdelrahma Mohamed, Michael Auli

Abstract

We show for the first time that learig powerful represetatios from speech audio aloe followed by fie-tuig o trascribed speech ca outperform the best semi-supervised methods while beig coceptually simpler. wav2vec 2.0 masks the speech iput i the latet space ad solves a cotrastive task defied over a quatizatio of the latet represetatios which are joitly leared. Experimets usig all labeled data of Librispeech achieve 1.8/3.3 WER o the clea/other test sets. Whe lowerig the amout of labeled data to oe hour, wav2vec 2.0 outperforms the previous state of the art o the 100 hour subset while usig 100 times less labeled data. Usig just te miutes of labeled data ad pre-traiig o 53k hours of ulabeled data still achieves 4.8/8.2 WER. This demostrates the feasibility of speech recogitio with limited amouts of labeled data.

The origial model ca be foud uder https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

Usage

To trascribe audio files the model ca be used as a stadaloe acoustic model as follows:

 from trasformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import torch

 # load model ad tokeizer
 processor = Wav2Vec2Processor.from_pretraied("facebook/wav2vec2-base-960h")
 model = Wav2Vec2ForCTC.from_pretraied("facebook/wav2vec2-base-960h")

 # load dummy dataset ad read soudfiles
 ds = load_dataset("patrickvoplate/librispeech_asr_dummy", "clea", split="validatio")

 # tokeize
 iput_values = processor(ds[0]["audio"]["array"], retur_tesors="pt", paddig="logest").iput_values  # Batch size 1

 # retrieve logits
 logits = model(iput_values).logits

 # take argmax ad decode
 predicted_ids = torch.argmax(logits, dim=-1)
 trascriptio = processor.batch_decode(predicted_ids)

## Evaluatio

This code sippet shows how to evaluate facebook/wav2vec2-base-960h o LibriSpeech's "clea" ad "other" test data.

from datasets import load_dataset
from trasformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clea", split="test")

model = Wav2Vec2ForCTC.from_pretraied("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretraied("facebook/wav2vec2-base-960h")

def map_to_pred(batch):
    iput_values = processor(batch["audio"]["array"], retur_tesors="pt", paddig="logest").iput_values
    with torch.o_grad():
        logits = model(iput_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    trascriptio = processor.batch_decode(predicted_ids)
    batch["trascriptio"] = trascriptio
    retur batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_colums=["audio"])

prit("WER:", wer(result["text"], result["trascriptio"]))

Result (WER):

"clea" "other"
3.4 8.6

功能介绍

Wav2Vec2-Base-960h Facebook's Wav2Vec2 The base model pretrained and fine-tuned on 960 hours of Libr

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论