开源地址
https://modelscope.cn/models/AI-ModelScope/whisperkit-coreml

WhisperKit Trascriptio Quality

Dataset: `librispeech`

Short-form Audio (<30s/clip) - 5 hours of Eglish audiobook clips

	WER (↓)	QoI (↑)	File Size (MB)	Code Commit
large-v2 (WhisperOpeAIAPI)	2.35	100	3100	N/A
large-v2	2.77	96.6	3100	Lik
large-v2_949MB	2.4	94.6	949	Lik
large-v2_turbo	2.76	96.6	3100	Lik
large-v2turbo955MB	2.41	94.6	955	Lik
large-v3	2.04	95.2	3100	Lik
large-v3_turbo	2.03	95.4	3100	Lik
large-v3turbo954MB	2.47	93.9	954	Lik
distil-large-v3	2.47	89.7	1510	Lik
distil-large-v3_594MB	2.96	85.4	594	Lik
distil-large-v3_turbo	2.47	89.7	1510	Lik
distil-large-v3turbo600MB	2.78	86.2	600	Lik
small.e	3.12	85.8	483	Lik
small	3.45	83	483	Lik
base.e	3.98	75.3	145	Lik
base	4.97	67.2	145	Lik
tiy.e	5.61	63.9	66	Lik
tiy	7.47	52.5	66	Lik

Dataset: `earigs22`

Log-Form Audio (>1hr/clip) - 120 hours of earigs call recordigs i Eglish with various accets

	WER (↓)	QoI (↑)	File Size (MB)	Code Commit
large-v2 (WhisperOpeAIAPI)	16.27	100	3100	N/A
large-v3	15.17	58.5	3100	Lik
distil-large-v3	15.28	46.3	1510	Lik
base.e	23.49	6.5	145	Lik
tiy.e	28.64	5.7	66	Lik

Explaatio

We believe that rigorously measurig the quality of iferece is ecessary for developers ad eterprises to make iformed decisios whe optig to use optimized or compressed variats of ay machie learig model i productio. To cotextualize WhisperKit, we take the followig Whisper implemetatios ad bechmark them usig a cosistet evaluatio haress:

Server-side:

WhisperOpeAIAPI: OpeAI's Whisper API

($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)

O-device:

WhisperKit: Argmax's implemetatio [Eval Haress] [Repo]
whisper.cpp: A C++ implemetatio form ggergaov [Eval Haress] [Repo]
WhisperMLX: A Pytho implemetatio from Apple MLX [Eval Haress] [Repo]

(All o-device implemetatios are available for free uder MIT licese as of 03/19/2024)

WhisperOpeAIAPI sets the referece ad we assume that it is usig the equivalet of opeai/whisper-large-v2 i float16 precisio alog with additioal udisclosed optimizatios from OpeAI. I all measuremets, we care primarily about per-example o-regressios (quatified as qoi below) which is a stricter metric compared to dataset average Word Error RATE (WER). A 100% qoi preserves perfect backwards-compatibility o the test distributio ad avoids "perceived regressios", the pheomeo where per-example kow behavior chages after a code/model update ad causes divergece i dowstream code or breaks the user experiece itself (eve if dataset averages might stay flat across updates). Pseudocode for qoi:

qoi = []
for example i dataset:
    o_regressio = wer(optimized_model(example)) <= wer(referece_model(example))
    qoi.apped(o_regressio)
qoi = (sum(qoi) / le(qoi)) * 100.

Note that the orderig of models with respect to WER does ot ecessarily match the orderig with respect to QoI. This is because the referece model gets assiged a QoI of 100% by defiitio. Ay per-example regressio by other implemetatios get pealized while per-example improvemets are ot rewarded. QoI (higher is better) matters where the productio behavior is established by the referece results ad the goal is to ot regress whe switchig to a optimized or compressed model. O the other had, WER (lower is better) matters whe there is o established productio behavior ad oe is pickig the best quality versus model size trade off poit.

We aticipate developers that use Whisper (or similar models) i productio to have their ow Quality Assurace test sets ad whisperkittools offers the toolig ecessary to ru the same measuremets o such custom test sets, please see the Model Evaluatio o Custom Dataset) for details.

Why are there so may Whisper versios?

WhisperKit is a SDK for buildig speech-to-text features i apps across a wide rage of Apple devices. We are workig towards abstractig away the model versioig from the developer so WhisperKit "just works" by deployig the highest-quality model versio that a particular device ca execute. I the iterim, we leave the choice to the developer by providig quality ad size trade-offs.

Datasets

librispeech: ~5 hours of short Eglish audio clips, tests short-form trascriptio quality
earigs22: ~120 hours of Eglish audio clips from earigs calls with various accets, tests log-form trascriptio quality

Reproducig Results

Bechmark results o this page were automatically geerated by whisperkittools usig our cluster of Apple Silico Macs as self-hosted ruers o Github Actios. We periodically recompute these bechmarks as part of our CI pipelie. Due to security cocers, we are uable to ope up the cluster to the public. However, ay Apple Silico Mac (eve with 8GB RAM) ca be used to ru idetical evaluatio jobs locally. For referece, our M2 Ultra devices complete a librispeech + opeai/whisper-large-v3 evaluatio i uder 1 hour regardless of the Whisper implemetatio. Oldest Apple Silico Macs should take less tha 1 day to complete the same evaluatio.

Glossary

_turbo: Idicates the presece of additioal optimizatios (ot compressio) to ulock streamig trascriptio as described i our Blog Post.
_*MB: Idicates the presece of model compressio. Istead of clutterig the fileame with details like _AudioEcoder-5.8bits_TextDecoder-6.1bits_QLoRA-rak=16, we choose to summarize the compressio spec as the resultig total file size sice this is what matters to developers i productio.

WhisperKit Transcription Quality Dataset: librispeech Short-form Audio (

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

whisperkit-coreml

技术信息

作品详情

WhisperKit Trascriptio Quality

Dataset: `librispeech`

Dataset: `earigs22`

Explaatio

Why are there so may Whisper versios?

Datasets

Reproducig Results

Glossary

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐

whisperkit-coreml

技术信息

作品详情

WhisperKit Trascriptio Quality

Dataset: librispeech

Dataset: earigs22

Explaatio

Why are there so may Whisper versios?

Datasets

Reproducig Results

Glossary

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐

Dataset: `librispeech`

Dataset: `earigs22`