Multi-Stream Multi-Class Fusion Of Deep Networks For Video Classification

CoRR(2016)

引用 223|浏览86
暂无评分
摘要
This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams on multiple classes, we propose to mine class relationships hidden in the data from the trained models. The automatically discovered relationships are then leveraged in the multi-stream multi-class fusion process as a prior, indicating which and how much information is needed from the remaining classes, to adaptively determine the optimal fusion weights for generating the final scores of each class. Our contributions are two-fold. First, the multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, our proposed fusion method not only learns the best weights of the multiple network streams for each class, but also takes class relationship into account, which is known as a helpful clue in multi-class visual classification tasks. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2% on UCF-101 (without using audio) and 84.9% on Columbia Consumer Videos.
更多
查看译文
关键词
Video Classification,CNN,LSTM,Fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要