Recurrent Relational Memory Network for Unsupervised Image Captioning

IJCAI, pp. 920-926, 2020.

Cited by: 0|Bibtex|Views99|Links
EI
Keywords:
cross modalimage captioningmulti-layer perceptionrecurrent relational memory networkunsupervised image captioningMore(8+)
Weibo:
This paper proposes a novel recurrent relational memory network for unsupervised image captioning with low cost of supervision

Abstract:

Unsupervised image captioning with no annotations is an emerging challenge in computer vision, where the existing arts usually adopt GAN (Generative Adversarial Networks) models. In this paper, we propose a novel memory-based network rather than GAN, named Recurrent Relational Memory Network ($R^2M$). Unlike complicated and sensitive ad...More

Code:

Data:

0
Introduction
  • Traditional image captioning [Yao et al, 2019; Huang et al, 2019] requires full supervision of image-caption pairs annotated by humans.
  • Such full supervision is ridiculously expensive to acquire in cross-modal datasets.
Highlights
  • Traditional image captioning [Yao et al, 2019; Huang et al, 2019] requires full supervision of image-caption pairs annotated by humans
  • And 2, we develop a joint exploitation of supervised learning (SPL) and unsupervised learning (UPL) on the disjoint datasets
  • Orthogonal to GAN-based architectures for unsupervised image captioning, we propose a novel light Recurrent Relational Memory Network (R2M ), which merely utilizes the attention-based memory to perform the relational semantics reasoning and reconstruction
  • The promising improvements demonstrate the consistency of superior performances
  • At time t “ 2, fusion memory focuses much more on the previous word portraittt“1u as portrait is the first generated concept and deserves more attention
  • This paper proposes a novel recurrent relational memory network (R2M ) for unsupervised image captioning with low cost of supervision
Methods
  • The authors formally discuss the proposed R2M. The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor. 2.1 R2M.
  • The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor.
  • The authors first discuss the encoder.
  • Visual concepts V are randomly and sequentially incorporated into LSTM with their word embeddings, leading to the encoded vector v “ vI or vS from the author ors S.
  • SME-GAN [Laina et al, 2019] - - - 7.9 R2M MSCOCOØGCC.
  • SME-GAN [Laina et al, 2019] - - - 6.5 CIDEr SPICE Dataset.
Results
  • Comparison with the State-of-the-arts R2M exhibits large improvements across all the metrics.
  • Both UC-GAN [Feng et al, 2019] and SME-GAN [Laina et al, 2019] rely on complicated GAN training strategies, whereas ours R2M is a memory solution.
  • As shown in Table 1, R2M upgrades BLEU-4 (B-4) by 14.3%, 48.1% and 27.7% on three datasets, where BLEU-4 involves 4gram phrases
  • It implies the stronger capacity of R2M to learn long-range dependencies than others.
Conclusion
  • This paper proposes a novel recurrent relational memory network (R2M ) for unsupervised image captioning with low cost of supervision.
  • R2M is a lightweight network, characterizing self-attention and a relational gate to design the fusion and recurrent memory for long-term semantic generation.
  • R2M: handsome man in shirt and tie.
  • R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on?
  • R2M w SentiCap: a man is wearing a suit , tie and nice shirt
  • R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on? R2M w SentiCap: a man is wearing a suit , tie and nice shirt
Summary
  • Introduction:

    Traditional image captioning [Yao et al, 2019; Huang et al, 2019] requires full supervision of image-caption pairs annotated by humans.
  • Such full supervision is ridiculously expensive to acquire in cross-modal datasets.
  • Methods:

    The authors formally discuss the proposed R2M. The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor. 2.1 R2M.
  • The overall architecture of R2M is depicted in Fig.2, which consists of three modules: encoder, decoder and reconstructor.
  • The authors first discuss the encoder.
  • Visual concepts V are randomly and sequentially incorporated into LSTM with their word embeddings, leading to the encoded vector v “ vI or vS from the author ors S.
  • SME-GAN [Laina et al, 2019] - - - 7.9 R2M MSCOCOØGCC.
  • SME-GAN [Laina et al, 2019] - - - 6.5 CIDEr SPICE Dataset.
  • Results:

    Comparison with the State-of-the-arts R2M exhibits large improvements across all the metrics.
  • Both UC-GAN [Feng et al, 2019] and SME-GAN [Laina et al, 2019] rely on complicated GAN training strategies, whereas ours R2M is a memory solution.
  • As shown in Table 1, R2M upgrades BLEU-4 (B-4) by 14.3%, 48.1% and 27.7% on three datasets, where BLEU-4 involves 4gram phrases
  • It implies the stronger capacity of R2M to learn long-range dependencies than others.
  • Conclusion:

    This paper proposes a novel recurrent relational memory network (R2M ) for unsupervised image captioning with low cost of supervision.
  • R2M is a lightweight network, characterizing self-attention and a relational gate to design the fusion and recurrent memory for long-term semantic generation.
  • R2M: handsome man in shirt and tie.
  • R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on?
  • R2M w SentiCap: a man is wearing a suit , tie and nice shirt
  • R2M w VQA-v2: what color is the shirt of the man with the tie and suit have on? R2M w SentiCap: a man is wearing a suit , tie and nice shirt
Tables
  • Table1: Performance comparison with the state-of-the-art methods. The best performance is marked with bold face
  • Table2: Ablation studies of R2M with different memory settings
  • Table3: Ablation studies of R2M with different losses. The best performance is marked with bold face
Download tables as Excel
Funding
  • This work is supported by the National Key Research and Development Program of China under grant 2018YFB0804205, and the National Natural Science Foundation of China (NSFC) under grants 61806035, U1936217, 61725203, 61732008, and 61876058
Reference
  • [Anderson et al., 2017] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search. In EMNLP, page 936–945, 2017.
    Google ScholarLocate open access versionFindings
  • [Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
    Google ScholarLocate open access versionFindings
  • [Donahue and Simonyan, 2019] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In NeurIPS, pages 10541–10551, 2019.
    Google ScholarLocate open access versionFindings
  • [Fan et al., 2019] Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007, 2019.
    Google ScholarLocate open access versionFindings
  • [Feng et al., 2019] Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Unsupervised image captioning. In CVPR, pages 4125–4134, 2019.
    Google ScholarLocate open access versionFindings
  • [Fu et al., 2019] Canmiao Fu, Wenjie Pei, Qiong Cao, Chaopeng Zhang, Yong Zhao, Xiaoyong Shen, and YuWing Tai. Non-local recurrent neural memory for supervised sequence modeling. In ICCV, pages 6311–6320, 2019.
    Google ScholarLocate open access versionFindings
  • [Gu et al., 2019] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. Unpaired image captioning via scene graph alignments. In ICCV, pages 10323–10332, 2019.
    Google ScholarLocate open access versionFindings
  • [Guo et al., 2019] Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. Mscap: Multi-style image captioning with unpaired stylized text. In CVPR, pages 4204– 4213, 2019.
    Google ScholarLocate open access versionFindings
  • [Huang and Wang, 2019] Yan Huang and Liang Wang. Acmm: Aligned cross-modal memory for few-shot image and sentence matching. In ICCV, pages 5774–5783, 2019.
    Google ScholarLocate open access versionFindings
  • [Huang et al., 2017] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, pages 7310–7311, 2017.
    Google ScholarLocate open access versionFindings
  • [Huang et al., 2019] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In ICCV, pages 4634–4643, 2019.
    Google ScholarLocate open access versionFindings
  • [Krasin et al., 2017] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multilabel and multi-class image classification. Dataset available from https://github.com/openimages, 2:3, 2017.
    Locate open access versionFindings
  • [Kuznetsova et al., 2018] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi PontTuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
    Findings
  • [Laina et al., 2019] Iro Laina, Christian Rupprecht, and Nassir Navab. Towards unsupervised image captioning with shared multimodal embeddings. In ICCV, pages 7414– 7424, 2019.
    Google ScholarLocate open access versionFindings
  • [Lample et al., 2018] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
    Google ScholarLocate open access versionFindings
  • [Mathews et al., 2016] Alexander Patrick Mathews, Lexing Xie, and Xuming He. Senticap: Generating image descriptions with sentiments. In AAAI, pages 3574–3580, 2016.
    Google ScholarLocate open access versionFindings
  • [Nie et al., 2019] Weili Nie, Nina Narodytska, and Ankit Patel. Relgan: Relational generative adversarial networks for text generation. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • [Pei et al., 2019] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In CVPR, pages 8347–8356, 2019.
    Google ScholarLocate open access versionFindings
  • [Santoro et al., 2018] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In NeurIPS, pages 7299–7310, 2018.
    Google ScholarLocate open access versionFindings
  • [Sharma et al., 2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
    Google ScholarLocate open access versionFindings
  • [Szegedy et al., 2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, pages 4278–4284, 2017.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2018] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neural machine translation with weight sharing. In ACL, pages 46–55, 2018.
    Google ScholarLocate open access versionFindings
  • [Yao et al., 2019] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Hierarchy parsing for image captioning. In ICCV, pages 2621–2629, 2019.
    Google ScholarLocate open access versionFindings
  • [Young et al., 2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments