Recurrent Relational Memory Network for Unsupervised Image Captioning
IJCAI, pp. 920-926, 2020.
EI
Keywords:
cross modalimage captioningmulti-layer perceptionrecurrent relational memory networkunsupervised image captioningMore(8+)
Weibo:
Abstract:
Unsupervised image captioning with no annotations is an emerging challenge in computer vision, where the existing arts usually adopt GAN (Generative Adversarial Networks) models. In this paper, we propose a novel memory-based network rather than GAN, named Recurrent Relational Memory Network ($R^2M$). Unlike complicated and sensitive ad...More
Code:
Data:
Introduction
- Traditional image captioning [Yao et al, 2019; Huang et al, 2019] requires full supervision of image-caption pairs annotated by humans.
- Such full supervision is ridiculously expensive to acquire in cross-modal datasets.
Highlights
- Traditional image captioning [Yao et al, 2019; Huang et al, 2019] requires full supervision of image-caption pairs annotated by humans
- And 2, we develop a joint exploitation of supervised learning (SPL) and unsupervised learning (UPL) on the disjoint datasets
- Orthogonal to GAN-based architectures for unsupervised image captioning, we propose a novel light Recurrent Relational Memory Network (R2M ), which merely utilizes the attention-based memory to perform the relational semantics reasoning and reconstruction
- The promising improvements demonstrate the consistency of superior performances
- At time t “ 2, fusion memory focuses much more on the previous word portraittt“1u as portrait is the first generated concept and deserves more attention
- This paper proposes a novel recurrent relational memory network (R2M ) for unsupervised image captioning with low cost of supervision
Methods
- The authors formally discuss the proposed R2M. The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor. 2.1 R2M.
- The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor.
- The authors first discuss the encoder.
- Visual concepts V are randomly and sequentially incorporated into LSTM with their word embeddings, leading to the encoded vector v “ vI or vS from the author ors S.
- SME-GAN [Laina et al, 2019] - - - 7.9 R2M MSCOCOØGCC.
- SME-GAN [Laina et al, 2019] - - - 6.5 CIDEr SPICE Dataset.
Results
- Comparison with the State-of-the-arts R2M exhibits large improvements across all the metrics.
- Both UC-GAN [Feng et al, 2019] and SME-GAN [Laina et al, 2019] rely on complicated GAN training strategies, whereas ours R2M is a memory solution.
- As shown in Table 1, R2M upgrades BLEU-4 (B-4) by 14.3%, 48.1% and 27.7% on three datasets, where BLEU-4 involves 4gram phrases
- It implies the stronger capacity of R2M to learn long-range dependencies than others.
Conclusion
- This paper proposes a novel recurrent relational memory network (R2M ) for unsupervised image captioning with low cost of supervision.
- R2M is a lightweight network, characterizing self-attention and a relational gate to design the fusion and recurrent memory for long-term semantic generation.
- R2M: handsome man in shirt and tie.
- R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on?
- R2M w SentiCap: a man is wearing a suit , tie and nice shirt
- R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on? R2M w SentiCap: a man is wearing a suit , tie and nice shirt
Summary
Introduction:
Traditional image captioning [Yao et al, 2019; Huang et al, 2019] requires full supervision of image-caption pairs annotated by humans.- Such full supervision is ridiculously expensive to acquire in cross-modal datasets.
Methods:
The authors formally discuss the proposed R2M. The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor. 2.1 R2M.- The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor.
- The authors first discuss the encoder.
- Visual concepts V are randomly and sequentially incorporated into LSTM with their word embeddings, leading to the encoded vector v “ vI or vS from the author ors S.
- SME-GAN [Laina et al, 2019] - - - 7.9 R2M MSCOCOØGCC.
- SME-GAN [Laina et al, 2019] - - - 6.5 CIDEr SPICE Dataset.
Results:
Comparison with the State-of-the-arts R2M exhibits large improvements across all the metrics.- Both UC-GAN [Feng et al, 2019] and SME-GAN [Laina et al, 2019] rely on complicated GAN training strategies, whereas ours R2M is a memory solution.
- As shown in Table 1, R2M upgrades BLEU-4 (B-4) by 14.3%, 48.1% and 27.7% on three datasets, where BLEU-4 involves 4gram phrases
- It implies the stronger capacity of R2M to learn long-range dependencies than others.
Conclusion:
This paper proposes a novel recurrent relational memory network (R2M ) for unsupervised image captioning with low cost of supervision.- R2M is a lightweight network, characterizing self-attention and a relational gate to design the fusion and recurrent memory for long-term semantic generation.
- R2M: handsome man in shirt and tie.
- R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on?
- R2M w SentiCap: a man is wearing a suit , tie and nice shirt
- R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on? R2M w SentiCap: a man is wearing a suit , tie and nice shirt
Tables
- Table1: Performance comparison with the state-of-the-art methods. The best performance is marked with bold face
- Table2: Ablation studies of R2M with different memory settings
- Table3: Ablation studies of R2M with different losses. The best performance is marked with bold face
Funding
- This work is supported by the National Key Research and Development Program of China under grant 2018YFB0804205, and the National Natural Science Foundation of China (NSFC) under grants 61806035, U1936217, 61725203, 61732008, and 61876058
Reference
- [Anderson et al., 2017] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search. In EMNLP, page 936–945, 2017.
- [Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
- [Donahue and Simonyan, 2019] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In NeurIPS, pages 10541–10551, 2019.
- [Fan et al., 2019] Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007, 2019.
- [Feng et al., 2019] Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Unsupervised image captioning. In CVPR, pages 4125–4134, 2019.
- [Fu et al., 2019] Canmiao Fu, Wenjie Pei, Qiong Cao, Chaopeng Zhang, Yong Zhao, Xiaoyong Shen, and YuWing Tai. Non-local recurrent neural memory for supervised sequence modeling. In ICCV, pages 6311–6320, 2019.
- [Gu et al., 2019] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. Unpaired image captioning via scene graph alignments. In ICCV, pages 10323–10332, 2019.
- [Guo et al., 2019] Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. Mscap: Multi-style image captioning with unpaired stylized text. In CVPR, pages 4204– 4213, 2019.
- [Huang and Wang, 2019] Yan Huang and Liang Wang. Acmm: Aligned cross-modal memory for few-shot image and sentence matching. In ICCV, pages 5774–5783, 2019.
- [Huang et al., 2017] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, pages 7310–7311, 2017.
- [Huang et al., 2019] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In ICCV, pages 4634–4643, 2019.
- [Krasin et al., 2017] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multilabel and multi-class image classification. Dataset available from https://github.com/openimages, 2:3, 2017.
- [Kuznetsova et al., 2018] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi PontTuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
- [Laina et al., 2019] Iro Laina, Christian Rupprecht, and Nassir Navab. Towards unsupervised image captioning with shared multimodal embeddings. In ICCV, pages 7414– 7424, 2019.
- [Lample et al., 2018] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. In ICLR, 2018.
- [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
- [Mathews et al., 2016] Alexander Patrick Mathews, Lexing Xie, and Xuming He. Senticap: Generating image descriptions with sentiments. In AAAI, pages 3574–3580, 2016.
- [Nie et al., 2019] Weili Nie, Nina Narodytska, and Ankit Patel. Relgan: Relational generative adversarial networks for text generation. In ICLR, 2019.
- [Pei et al., 2019] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In CVPR, pages 8347–8356, 2019.
- [Santoro et al., 2018] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In NeurIPS, pages 7299–7310, 2018.
- [Sharma et al., 2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- [Szegedy et al., 2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, pages 4278–4284, 2017.
- [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- [Yang et al., 2018] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neural machine translation with weight sharing. In ACL, pages 46–55, 2018.
- [Yao et al., 2019] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Hierarchy parsing for image captioning. In ICCV, pages 2621–2629, 2019.
- [Young et al., 2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
Tags
Comments