Extending Large Language Models for Speech and Audio Captioning

Changli Tang,Wenyi Yu,Guangzhi Sun, Xianzhao Chen, Tian Tan,Wei Li,Lu Lu, Zejun Ma,Chao Zhang

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览3
暂无评分
摘要
Multimodal large language models (LLMs) have shown promising visual perception abilities by connecting with image encoders, but their performance on auditory tasks has not yet been widely investigated. Meanwhile, automatic speech recognition (ASR) and automatic audio captioning (AAC) are often achieved with separate systems, resulting in incomplete auditory perception abilities. To fill in these gaps, in this paper, we present the first study that achieves both ASR and AAC by connecting an LLM with auditory encoders. A dual auditory encoder structure is proposed, integrating the Whisper encoder for speech and the BEATs encoder for audio events with a high temporal resolution by using a Q-Former at the window level. Experiments for ASR and AAC are performed correspondingly on the widely used LibriSpeech, GigaSpeech, WavCaps, AudioCaps, and Clotho datasets and yield promising results. In particular, state-of-the-art results are achieved on GigaSpeech, AudioCaps and Clotho. Our model is also able to caption speech and audio events simultaneously from clips with mixed speech and background audio events, which is a step towards more complete machine auditory perception.
更多
查看译文
关键词
Multimodal large language model,automatic speech recognition,audio captioning,dual encoders,Q-Former
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要