MAF: Multimodal Auto Attention Fusion for Video Classification.

Chengjie Zheng,Wei Ding, Shiqian Shen,Ping Chen

IEA/AIE (1)(2023)

引用 0|浏览5
暂无评分
摘要
Video classification is a complex task that involves analyzing audio and video signals using deep neural models. To reliably classify these signals, researchers have developed multimodal fusion techniques that combine audio and video data into compact, quickly processed representations. However, previous approaches to multimodal data fusion have relied heavily on manually designed attention mechanisms. To address these limitations, we propose the Multimodal Auto Attention Fusion (MAF) model, which uses Neural Architecture Search (NAS) to automatically identify effective attentional representations for a wide range of tasks. Our approach includes a custom-designed search space that allows for the automatic generation of attention representations. Using automated Key, Query, and Value representation design, the MAF model enhances its self-attentiveness, allowing for the creation of highly effective attention representation designs. Compared to other multimodal fusion methods, our approach exhibits competitive performance in detecting modality interactions. We conducted experiments on three large datasets (UCF101, ActivityNet, and YouTube-8M), which confirmed the effectiveness of our approach and demonstrated its superior performance compared to other popular models. Furthermore, our approach exhibits robust generalizability across diverse datasets.
更多
查看译文
关键词
multimodal auto attention fusion,maf,classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要