An investigation of phone-based subword units for end-to-end speech recognition


引用 35|浏览80
Phones and their context-dependent variants have been the standard modeling units for conventional speech recognition systems, while characters and character-based subwords are becoming increasingly popular for end-to-end recognition systems. We investigate the use of phone-based subwords, and byte pair encoding (BPE) in particular, as modeling units for end-to-end speech recognition, and develop multi-level language model-based decoding algorithms based on a pronunciation dictionary. Besides the use of the lexicon which is easily available, our system avoids the need of additional expert knowledge or processing steps from conventional systems. Experimental results show that phone-based BPEs lead to more accurate recognition systems than the character-based counterpart, and further improvement can be obtained with the newly developed one-pass beam search decoder, which efficiently combines both phone-based and character-based BPE systems. For Switchboard, our phone-based BPE system achieves 7.9%/16.1% word error rates (WER) on the Switchboard/CallHome portion of the test set while the ensemble system achieves 7.2%/15.0% WER.
AI 理解论文