Response generation in multi-modal dialogues with split pre-generation and cross-modal contrasting

INFORMATION PROCESSING & MANAGEMENT(2024)

引用 0|浏览19
暂无评分
摘要
Due to the natural multi-modal occurrence format (text, audio, vision) of the dialogues, textual response generation in dialogues should rely on the multi-modal contexts beyond text only. However, most existing studies normally ignore the rich information of other modalities, such as audio. To investigate the importance of the acoustic contexts, we explore the multi -modal dialogue scenario with aligned text and audio temporal sequences for textual response generation of an assumed system, namely RGMD task. To this end, we construct a new multi -modal dataset for this task based on TV shows, which contains 84.9K utterances. Considering the response diversity limited by the context and modality interactions for RGMD, we attempt the split pre-generation (SPG) strategy and the cross-modal contrastive learning (CCL) strategy in multi-modal pre-training for better response generation. On the one hand, with SPG, we can obtain many diverse responses without the restrictions of too many historical mixed multi-modal contexts. On the other hand, with CCL, we can capture the interactions between text and audio. Extensive experiments demonstrate that our approach based on BART can consistently perform better than the state-of-the-art textual approach DP by 4.17%, 8.96%, 2.43%, 1.04% and 7.54% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Moreover, our approach based on GPT can outperform the state-of-the-art multi-modal approach RLM by 6.79%, 9.25%, 7.49%, 9.31% and 13.75% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Besides, we conduct much in-depth analysis, showing the necessity of audio for response generation and further verifying the effectiveness of our approach.
更多
查看译文
关键词
Response generation,Multi-modal dialogues,Text and audio,Split pre-generation,Cross-modal contrastive learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要