Exploring the Limits of Decoder-Only Models Trained on Public Speech Recognition Corpora

Conference of the International Speech Communication Association（2024）

Cited 1|Views31

Abstract

The emergence of industrial-scale speech recognition (ASR) models such asWhisper and USM, trained on 1M hours of weakly labelled and 12M hours of audioonly proprietary data respectively, has led to a stronger need for large scalepublic ASR corpora and competitive open source pipelines. Unlike the saidmodels, large language models are typically based on Transformer decoders, andit remains unclear if decoder-only models trained on public data alone candeliver competitive performance. In this work, we investigate factors such aschoice of training datasets and modeling components necessary for obtaining thebest performance using public English ASR corpora alone. Our Decoder-OnlyTransformer for ASR (DOTA) model comprehensively outperforms theencoder-decoder open source replication of Whisper (OWSM) on nearly all EnglishASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. Werelease our codebase and model checkpoints under permissive license.

Translated text

Key words

Acoustic Modeling,Automatic Speech Recognition

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined