whisperkit-coreml

我要开发同款
匿名用户2024年07月31日
77阅读

技术信息

开源地址
https://modelscope.cn/models/AI-ModelScope/whisperkit-coreml

作品详情

WhisperKit Trascriptio Quality

Dataset: librispeech

Short-form Audio (<30s/clip) - 5 hours of Eglish audiobook clips

WER (↓) QoI (↑) File Size (MB) Code Commit
large-v2 (WhisperOpeAIAPI) 2.35 100 3100 N/A
large-v2 2.77 96.6 3100 Lik
large-v2_949MB 2.4 94.6 949 Lik
large-v2_turbo 2.76 96.6 3100 Lik
large-v2turbo955MB 2.41 94.6 955 Lik
large-v3 2.04 95.2 3100 Lik
large-v3_turbo 2.03 95.4 3100 Lik
large-v3turbo954MB 2.47 93.9 954 Lik
distil-large-v3 2.47 89.7 1510 Lik
distil-large-v3_594MB 2.96 85.4 594 Lik
distil-large-v3_turbo 2.47 89.7 1510 Lik
distil-large-v3turbo600MB 2.78 86.2 600 Lik
small.e 3.12 85.8 483 Lik
small 3.45 83 483 Lik
base.e 3.98 75.3 145 Lik
base 4.97 67.2 145 Lik
tiy.e 5.61 63.9 66 Lik
tiy 7.47 52.5 66 Lik

Dataset: earigs22

Log-Form Audio (>1hr/clip) - 120 hours of earigs call recordigs i Eglish with various accets

WER (↓) QoI (↑) File Size (MB) Code Commit
large-v2 (WhisperOpeAIAPI) 16.27 100 3100 N/A
large-v3 15.17 58.5 3100 Lik
distil-large-v3 15.28 46.3 1510 Lik
base.e 23.49 6.5 145 Lik
tiy.e 28.64 5.7 66 Lik

Explaatio

We believe that rigorously measurig the quality of iferece is ecessary for developers ad eterprises to make iformed decisios whe optig to use optimized or compressed variats of ay machie learig model i productio. To cotextualize WhisperKit, we take the followig Whisper implemetatios ad bechmark them usig a cosistet evaluatio haress:

Server-side:

($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)

O-device:

(All o-device implemetatios are available for free uder MIT licese as of 03/19/2024)

WhisperOpeAIAPI sets the referece ad we assume that it is usig the equivalet of opeai/whisper-large-v2 i float16 precisio alog with additioal udisclosed optimizatios from OpeAI. I all measuremets, we care primarily about per-example o-regressios (quatified as qoi below) which is a stricter metric compared to dataset average Word Error RATE (WER). A 100% qoi preserves perfect backwards-compatibility o the test distributio ad avoids "perceived regressios", the pheomeo where per-example kow behavior chages after a code/model update ad causes divergece i dowstream code or breaks the user experiece itself (eve if dataset averages might stay flat across updates). Pseudocode for qoi:

qoi = []
for example i dataset:
    o_regressio = wer(optimized_model(example)) <= wer(referece_model(example))
    qoi.apped(o_regressio)
qoi = (sum(qoi) / le(qoi)) * 100.

Note that the orderig of models with respect to WER does ot ecessarily match the orderig with respect to QoI. This is because the referece model gets assiged a QoI of 100% by defiitio. Ay per-example regressio by other implemetatios get pealized while per-example improvemets are ot rewarded. QoI (higher is better) matters where the productio behavior is established by the referece results ad the goal is to ot regress whe switchig to a optimized or compressed model. O the other had, WER (lower is better) matters whe there is o established productio behavior ad oe is pickig the best quality versus model size trade off poit.

We aticipate developers that use Whisper (or similar models) i productio to have their ow Quality Assurace test sets ad whisperkittools offers the toolig ecessary to ru the same measuremets o such custom test sets, please see the Model Evaluatio o Custom Dataset) for details.

Why are there so may Whisper versios?

WhisperKit is a SDK for buildig speech-to-text features i apps across a wide rage of Apple devices. We are workig towards abstractig away the model versioig from the developer so WhisperKit "just works" by deployig the highest-quality model versio that a particular device ca execute. I the iterim, we leave the choice to the developer by providig quality ad size trade-offs.

Datasets

  • librispeech: ~5 hours of short Eglish audio clips, tests short-form trascriptio quality
  • earigs22: ~120 hours of Eglish audio clips from earigs calls with various accets, tests log-form trascriptio quality

Reproducig Results

Bechmark results o this page were automatically geerated by whisperkittools usig our cluster of Apple Silico Macs as self-hosted ruers o Github Actios. We periodically recompute these bechmarks as part of our CI pipelie. Due to security cocers, we are uable to ope up the cluster to the public. However, ay Apple Silico Mac (eve with 8GB RAM) ca be used to ru idetical evaluatio jobs locally. For referece, our M2 Ultra devices complete a librispeech + opeai/whisper-large-v3 evaluatio i uder 1 hour regardless of the Whisper implemetatio. Oldest Apple Silico Macs should take less tha 1 day to complete the same evaluatio.

Glossary

  • _turbo: Idicates the presece of additioal optimizatios (ot compressio) to ulock streamig trascriptio as described i our Blog Post.

  • _*MB: Idicates the presece of model compressio. Istead of clutterig the fileame with details like _AudioEcoder-5.8bits_TextDecoder-6.1bits_QLoRA-rak=16, we choose to summarize the compressio spec as the resultig total file size sice this is what matters to developers i productio.

功能介绍

WhisperKit Transcription Quality Dataset: librispeech Short-form Audio (

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论