Learning Semantic Feature Map for Visual Content Recognition.

MM '17: ACM Multimedia Conference Mountain View California USA October, 2017(2017)

引用 12|浏览60
暂无评分
摘要
The spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of image and video contents. Specifically, we first extract high-level semantic object features on input image with convolutional neural networks for every object proposals, and organize them to the designed SFM so that spatial information among objects are preserved. To fully exploit the spatial relationship among objects, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) on top of SFM for final recognition. For better training, we also introduce a multi-task learning framework to train the model in an end-to-end manner. It is composed of an overall image classification loss as well as a grid labeling loss, which predicts the objects label at each SFM grid. Extensive experiments are conducted to verify the effectiveness of the proposed approach. For image classification, very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks. We also directly transfer the SFM learned on image domain to the video classification task. The results on CCV benchmark demonstrate the robustness and generalization capability of the proposed approach.
更多
查看译文
关键词
image representation, contextual fusion, image classification, video classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要