Short-form Audio (<30s/clip) - 5 hours of Eglish audiobook clips Log-Form Audio (>1hr/clip) - 120 hours of earigs call recordigs i Eglish with various accets We believe that rigorously measurig the quality of iferece is ecessary for developers ad
eterprises to make iformed decisios whe optig to use optimized or compressed variats of
ay machie learig model i productio. To cotextualize Server-side: ($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request) O-device: (All o-device implemetatios are available for free uder MIT licese as of 03/19/2024) Note that the orderig of models with respect to We aticipate developers that use Whisper (or similar models) i productio to have their ow Quality Assurace test sets ad whisperkittools offers
the toolig ecessary to ru the same measuremets o such custom test sets, please see the Model Evaluatio o Custom Dataset) for details. WhisperKit is a SDK for buildig speech-to-text features i apps across a wide rage of Apple devices. We are workig towards abstractig away the model versioig from the developer so WhisperKit
"just works" by deployig the highest-quality model versio that a particular device ca execute. I the iterim, we leave the choice to the developer by providig quality ad size trade-offs. Bechmark results o this page were automatically geerated by whisperkittools usig our cluster of Apple Silico Macs as self-hosted ruers o
Github Actios. We periodically recompute these bechmarks as part of our CI pipelie. Due to security cocers,
we are uable to ope up the cluster to the public. However, ay Apple Silico Mac (eve with 8GB RAM) ca be used to
ru idetical evaluatio jobs locally. For referece, our M2 Ultra devices complete a WhisperKit Trascriptio Quality
Dataset:
librispeech
WER (↓)
QoI (↑)
File Size (MB)
Code Commit
large-v2 (WhisperOpeAIAPI)
2.35
100
3100
N/A
large-v2
2.77
96.6
3100
Lik
large-v2_949MB
2.4
94.6
949
Lik
large-v2_turbo
2.76
96.6
3100
Lik
large-v2turbo955MB
2.41
94.6
955
Lik
large-v3
2.04
95.2
3100
Lik
large-v3_turbo
2.03
95.4
3100
Lik
large-v3turbo954MB
2.47
93.9
954
Lik
distil-large-v3
2.47
89.7
1510
Lik
distil-large-v3_594MB
2.96
85.4
594
Lik
distil-large-v3_turbo
2.47
89.7
1510
Lik
distil-large-v3turbo600MB
2.78
86.2
600
Lik
small.e
3.12
85.8
483
Lik
small
3.45
83
483
Lik
base.e
3.98
75.3
145
Lik
base
4.97
67.2
145
Lik
tiy.e
5.61
63.9
66
Lik
tiy
7.47
52.5
66
Lik
Dataset:
earigs22
WER (↓)
QoI (↑)
File Size (MB)
Code Commit
large-v2 (WhisperOpeAIAPI)
16.27
100
3100
N/A
large-v3
15.17
58.5
3100
Lik
distil-large-v3
15.28
46.3
1510
Lik
base.e
23.49
6.5
145
Lik
tiy.e
28.64
5.7
66
Lik
Explaatio
WhisperKit, we take the followig Whisper
implemetatios ad bechmark them usig a cosistet evaluatio haress:
WhisperOpeAIAPI: OpeAI's Whisper API
WhisperKit: Argmax's implemetatio [Eval Haress] [Repo]whisper.cpp: A C++ implemetatio form ggergaov [Eval Haress] [Repo]WhisperMLX: A Pytho implemetatio from Apple MLX [Eval Haress] [Repo]WhisperOpeAIAPI sets the referece ad we assume that it is usig the equivalet of opeai/whisper-large-v2
i float16 precisio alog with additioal udisclosed optimizatios from OpeAI. I all measuremets, we care primarily about per-example o-regressios (quatified as qoi below)
which is a stricter metric compared to dataset average Word Error RATE (WER). A 100% qoi preserves perfect backwards-compatibility o the test distributio ad avoids "perceived regressios", the pheomeo
where per-example kow behavior chages after a code/model update ad causes divergece i dowstream code or breaks the user experiece itself (eve if dataset averages might stay flat
across updates). Pseudocode for qoi:qoi = []
for example i dataset:
o_regressio = wer(optimized_model(example)) <= wer(referece_model(example))
qoi.apped(o_regressio)
qoi = (sum(qoi) / le(qoi)) * 100.
WER does ot ecessarily match the orderig with respect to QoI. This is because the referece model gets assiged
a QoI of 100% by defiitio. Ay per-example regressio by other implemetatios get pealized while per-example improvemets are ot rewarded. QoI (higher is better) matters
where the productio behavior is established by the referece results ad the goal is to ot regress whe switchig to a optimized or compressed model. O the other had,
WER (lower is better) matters whe there is o established productio behavior ad oe is pickig the best quality versus model size trade off poit.Why are there so may Whisper versios?
Datasets
Reproducig Results
librispeech + opeai/whisper-large-v3
evaluatio i uder 1 hour regardless of the Whisper implemetatio. Oldest Apple Silico Macs should take less tha 1 day to complete the same evaluatio.Glossary
_turbo: Idicates the presece of additioal optimizatios (ot compressio) to ulock streamig trascriptio
as described i our Blog Post._*MB: Idicates the presece of model compressio. Istead of clutterig the fileame with details like
_AudioEcoder-5.8bits_TextDecoder-6.1bits_QLoRA-rak=16, we choose to summarize the compressio spec as the
resultig total file size sice this is what matters to developers i productio.
点击空白处退出提示










评论