Beyond Vision: A Multimodal Recurrent Attention Convolutional Neural Network for Unified Image Aesthetic Prediction Tasks

IEEE Transactions on Multimedia(2021)

引用 40|浏览894
暂无评分
摘要
Over the past few years, image aesthetic prediction has attracted increasing attention because of its wide applications, such as image retrieval, photo album management and aesthetic-driven image enhancement. However, previous studies in this area only achieve limited success because 1) they primarily depend on visual features and ignore textual information. 2) they tend to focus equally on to each part of images and ignore the selective attention mechanism. This paper overcomes these limitations by proposing a novel multimodal recurrent attention convolutional neural network (MRACNN). More specifically, the MRACNN consists of two streams: the vision stream and the language stream. The former employs the recurrent attention network to tune out irrelevant information and focuses on some key regions to extract visual features. The latter utilizes the Text-CNN to capture the high-level semantics of user comments. Finally, a multimodal factorized bilinear (MFB) pooling approach is used to achieve effective fusion of textual and visual features. Extensive experiments demonstrate that the proposed MRACNN significantly outperforms state-of-the-art methods for unified aesthetic prediction tasks: (i) aesthetic quality classification; (ii) aesthetic score regression; and (iii) aesthetic score distribution prediction.
更多
查看译文
关键词
Image quality assessment,visual aesthetic quality assessment,long short-term memory (LSTM),deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要