Efficient Cascaded Streaming ASR System Via Frame Rate Reduction.
2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2023)
摘要
In this paper, we explore various frame rate reduction schemes on the two-pass cascaded encoder model to improve its efficiency without scarifying the transcription quality. We conduct extensive studies on frame rate reduction strategies, left and right context window length, trade-offs in quality, latency, computation and power consumption, and performance in short-and long-form datasets. With the proposed schemes, we can lower the 2nd pass frame rate to $120 \mathrm{~ms}$, half of the 1st pass’s. This achieves $20 \%$ RTF reduction / $13 \%$ power saving / $19 \%$ lower final latency, without impact on the word-error-rate nor partial results’ latency. If allowing partial latency increase, we can further reduce the frame rate to $180 \mathrm{~ms}$ or even $240 \mathrm{~ms}$ from the 1st pass, and obtain $45 \%$ RTF / 35% power savings, with a similar or even better (on the short-form testset) recognition accuracy.
更多查看译文
关键词
On-device ASR,cascaded streaming model,frame rate reduction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要