Abstractive Summarization for Video: A Revisit in Multistage Fusion Network With Forget Gate

IEEE Transactions on Multimedia(2023)

引用 0|浏览10
暂无评分
摘要
Multimodal abstractive summarization for videos is an emerging task that aims to generate a summary from multi-source information (i.e., video, audio transcript). The challenge is how to merge multimodal long sequences to capture rich semantic information without allowing possible noise from either lengthy modal sequence to degrade the other modality and thus hurt the entire model. To address the issues, we propose a multistage fusion network with forget gate (MFFG), which selectively integrates multi-source information through the cross-fusion in encoding and hierarchical fusion in decoding between modalities, and design a fusion forget gate module to suppress the potential multimodal noise flow of multi-source long sequence. Meanwhile, considering that the source text in this task is lengthy and has the same distribution as the output summary text, we inherit the partial structure of the MFFG model and again propose its variant, single-stage fusion network with forget gate (SFFG), which simplifies the fusion schema, and leverages the long source text to enhance the representation of the target summary. Experimental results on How2 dataset and How2-300 dataset demonstrate the superiority of the two multimodal fusion methods. Further, we provide a version of ASR transcription data of How2 dataset to evaluate model performance under noisy scenarios, and experimental results show obvious advantages of our proposed models over prior systems.
更多
查看译文
关键词
Abstractive summarization,abstractive summarization for videos,multimodal,multimodal fusion,forget gate
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要