WeChat Mini Program
Old Version Features

Exploring the Limits of Decoder-Only Models Trained on Public Speech Recognition Corpora

Conference of the International Speech Communication Association(2024)

Cited 1|Views31
Abstract
The emergence of industrial-scale speech recognition (ASR) models such asWhisper and USM, trained on 1M hours of weakly labelled and 12M hours of audioonly proprietary data respectively, has led to a stronger need for large scalepublic ASR corpora and competitive open source pipelines. Unlike the saidmodels, large language models are typically based on Transformer decoders, andit remains unclear if decoder-only models trained on public data alone candeliver competitive performance. In this work, we investigate factors such aschoice of training datasets and modeling components necessary for obtaining thebest performance using public English ASR corpora alone. Our Decoder-OnlyTransformer for ASR (DOTA) model comprehensively outperforms theencoder-decoder open source replication of Whisper (OWSM) on nearly all EnglishASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. Werelease our codebase and model checkpoints under permissive license.
More
Translated text
Key words
Acoustic Modeling,Automatic Speech Recognition
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined