Deep-Net Fusion To Classify Shots In Concert Videos

2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2017)

引用 28|浏览47
暂无评分
摘要
Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director to convey the emotion, ideas, and art. To classify such types of shots from images, we present a new framework that facilitates the intriguing task by addressing two key issues. We first focus on learning more effective features by fusing the layer-wise outputs extracted from a deep convolutional neural network (CNN), pre-trained on a large-scale dataset for object recognition. We then introduce a probabilistic fusion model, termed as error weighted deep crosscorrelation model (EW-Deep-CCM), to boost the classification accuracy. Specifically, the deep neural network-based cross-correlation model (Deep-CCM) is constructed to not only model the extracted feature hierarchies of CNN independently but also relate the statistieal dependencies of paired features from different layers. Then, a Bayesian error weighting sc he me for classifier combination is adopted to explore the contributions from individual Deep-CCM classi fiers to enhance the accuracy of shot classi fication. We provide extensive experimental results on a dataset of live concert videos to demonstrate the advantage of the proposed EW-Deep-CCM over existing popular fusion approaches. The video demos can be found at https: llsites. google. comlsite/ewdeepccm2/demo.
更多
查看译文
关键词
Types of shots, convolutional neural networks, live concert, language of film
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要