Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Ganchao Tan
Ganchao Tan
Daqing Liu
Daqing Liu
Zheng-Jun Zha
Zheng-Jun Zha

IJCAI, pp. 745-752, 2020.

Cited by: 0|Bibtex|Views49|Links
EI
Keywords:
visual groundingAttention over Spacevideo captioningvisual question answeringvisual recognitionMore(9+)
Weibo:
We proposed a novel reasoning neural module networks for video captioning that performs visual reasoning on each step along the generation process

Abstract:

Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence "a man is shooting a basketball", we need to first locate and describe the subject "man", next reason out the man is "shooting", then describe the ob...More

Code:

Data:

0
Introduction
  • The task aims to automatically generate natural language descriptions for videos, has received increasing attention in computer vision and machine learning.
  • Most existing video captioning methods [Venugopalan et al, 2015; Donahue et al, 2015] follow the encoder-decoder framework, where a CNN is employed as an encoder to produce the video features and an RNN is employed as a decoder to generate the captions
  • Those methods usually neglect the nature of the above human-level reasoning, hurting the explainability of the generation process.
  • The model must dynamically compose the reasoning structure along the generation process
Highlights
  • Video captioning, the task aims to automatically generate natural language descriptions for videos, has received increasing attention in computer vision and machine learning
  • Even though there are some recent works have explored the visual reasoning in visual question answering [Andreas et al, 2016; Hu et al, 2017; Yang et al, 2019a] and visual grounding [Cirik et al, 2018; Liu et al, 2019; Hong et al, 2019] by decomposing the questions or referring expressions into a linear or tree reasoning structure with several neural modules, the situation in the video captioning is more challenging because 1) unlike still images, videos contain richer visual content requiring more complex visual reasoning over both space and time, 2) unlike questions or referring expressions which are given in advance, the video descriptions are not available during the inference
  • Our main contributions are three-fold: 1) We propose a novel framework named reasoning module networks (RMN) for video captioning with three spatio-temporal visual reasoning modules; 2) We adopt a discrete module selector to dynamically compose the reasoning process with modules; 3) Our Reasoning Module Networks achieves new state-of-the-art performance with an explicit and explainable generation process
  • We proposed a novel reasoning neural module networks (RMN) for video captioning that performs visual reasoning on each step along the generation process
  • To dynamically compose the reasoning modules, we proposed a discrete module selector which is trained by a linguistic loss with a Gumbel approximation
  • Extensive experiments verified the effectiveness of the proposed Reasoning Module Networks, and the qualitative results indicated the caption generation process is explicit and explainable
Methods
  • 4.1 Datasets and Metrics

    The authors conduct experiments on two widely used video captioning datasets with several standard evaluation metrics to verify the effectiveness of the proposed method.

    Datasets MSVD.
  • The MSVD dataset [Chen and Dolan, 2011] consists of 1,970 short video clips selected from Youtube, where each one depicts a single activity in the open domain, and each video clip is annotated with multi-lingual captions.
  • The MSR-VTT [Xu et al, 2016] is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks.
  • The authors use the standard splits, namely 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing
Results
  • The authors use several widely used automatic evaluation metrics to evaluate the quality of the generated captions, i.e., BLEU [Papineni et al, 2002], METEOR [Banerjee and Lavie, 2005], CIDEr [Vedantam et al, 2015], ROUGE-L [Lin, 2004]
  • Most of these metrics are originally proposed for machine translation or image captioning, the higher score indicates better quality of the captions.
  • The vocabulary size is set to 7,351 for MSVD and 9,732 for MSR-VTT with removing the words appear less than twice and five times respectively.
Conclusion
  • The authors proposed a novel reasoning neural module networks (RMN) for video captioning that performs visual reasoning on each step along the generation process.
  • The authors designed three sophisticated reasoning modules for spatio-temporal visual reasoning.
  • To dynamically compose the reasoning modules, the authors proposed a discrete module selector which is trained by a linguistic loss with a Gumbel approximation.
  • Extensive experiments verified the effectiveness of the proposed RMN, and the qualitative results indicated the caption generation process is explicit and explainable
Summary
  • Introduction:

    The task aims to automatically generate natural language descriptions for videos, has received increasing attention in computer vision and machine learning.
  • Most existing video captioning methods [Venugopalan et al, 2015; Donahue et al, 2015] follow the encoder-decoder framework, where a CNN is employed as an encoder to produce the video features and an RNN is employed as a decoder to generate the captions
  • Those methods usually neglect the nature of the above human-level reasoning, hurting the explainability of the generation process.
  • The model must dynamically compose the reasoning structure along the generation process
  • Methods:

    4.1 Datasets and Metrics

    The authors conduct experiments on two widely used video captioning datasets with several standard evaluation metrics to verify the effectiveness of the proposed method.

    Datasets MSVD.
  • The MSVD dataset [Chen and Dolan, 2011] consists of 1,970 short video clips selected from Youtube, where each one depicts a single activity in the open domain, and each video clip is annotated with multi-lingual captions.
  • The MSR-VTT [Xu et al, 2016] is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks.
  • The authors use the standard splits, namely 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing
  • Results:

    The authors use several widely used automatic evaluation metrics to evaluate the quality of the generated captions, i.e., BLEU [Papineni et al, 2002], METEOR [Banerjee and Lavie, 2005], CIDEr [Vedantam et al, 2015], ROUGE-L [Lin, 2004]
  • Most of these metrics are originally proposed for machine translation or image captioning, the higher score indicates better quality of the captions.
  • The vocabulary size is set to 7,351 for MSVD and 9,732 for MSR-VTT with removing the words appear less than twice and five times respectively.
  • Conclusion:

    The authors proposed a novel reasoning neural module networks (RMN) for video captioning that performs visual reasoning on each step along the generation process.
  • The authors designed three sophisticated reasoning modules for spatio-temporal visual reasoning.
  • To dynamically compose the reasoning modules, the authors proposed a discrete module selector which is trained by a linguistic loss with a Gumbel approximation.
  • Extensive experiments verified the effectiveness of the proposed RMN, and the qualitative results indicated the caption generation process is explicit and explainable
Tables
  • Table1: The performance of ablated models with various settings on MSVD and MSR-VTT datasets. B@4, R, M, C denote BLEU-4, ROUGE L, METEOR, CIDEr, respectively
  • Table2: Comparing with the state-of-the-art on MSVD and MSRVTT datasets. B@4, R, M, C denote BLEU-4, ROUGE L, METEOR, CIDEr, respectively. The highest score is highlighted in bold and the second highest is underlined
Download tables as Excel
Related work
  • 2.1 Video Captioning

    There are two main directions to solve the video captioning problem. In the early stage, template-based methods [Kojima et al, 2002; Guadarrama et al, 2013], which first define a sentence template with grammar rules and then aligned subject, verb and object of the sentence template with video content, were widely studied. Those methods are hard to generate flexible language due to the fixed syntactic structure of the predefined template. Benefit from the rapid development of deep neural networks, the sequence learning methods [Venugopalan et al, 2015; Yao et al, 2015; Pan et al, 2017] are widely used to describe the video with flexible natural language, most of these methods are based on the encoder-decoder framework. [Venugopalan et al, 2015] proposed S2VT model which regards the video captioning task as a machine translation task. [Yao et al, 2015] introduced a temporal attention mechanism to assign weights to the features of each frame and then fused them based on the attention weights. [Li et al, 2017; Chen and Jiang, 2019] further applied spatial attention mechanisms on each frame.

    Recently, [Wang et al, 2019] and [Hou et al, 2019] proposed to leverage Part-of-Speech (POS) tags to boost video captioning. [Wang et al, 2019] encodes the predicted POS sequences into hidden features, which further guides the generation process. [Hou et al, 2019] mixes word probabilities of multiple components at each timestep conditioned on the inferred POS tags. However, both of them lack the reasoning capability for rich video content. On the contrary, we propose three well-designed reasoning module networks that correspond to three fundamental reasoning mechanisms.
Funding
  • This work was supported by the National Key R&D Program of China under Grant 2017YFB1300201, the National Natural Science Foundation of China (NSFC) under Grants U19B2038, 61620106009 and 61725203 as well as the Fundamental Research Funds for the Central Universities under Grant WK2100100030
Reference
  • [Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Andreas et al., 2016] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [Banerjee and Lavie, 2005] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
    Google ScholarLocate open access versionFindings
  • [Carreira and Zisserman, 2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [Chen and Dolan, 2011] David L Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
    Google ScholarLocate open access versionFindings
  • [Chen and Jiang, 2019] Shaoxiang Chen and Yu-Gang Jiang. Motion guided spatial attention for video captioning. In AAAI, 2019.
    Google ScholarFindings
  • [Cirik et al., 2018] Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • [Donahue et al., 2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
    Google ScholarFindings
  • [Guadarrama et al., 2013] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013.
    Google ScholarFindings
  • [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Hong et al., 2019] Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. Learning to compose and reason with language tree structures for visual grounding. T-PAMI, 2019.
    Google ScholarLocate open access versionFindings
  • [Hou et al., 2019] Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. Joint syntax representation learning and visual cue translation for video captioning. In ICCV, 2019.
    Google ScholarFindings
  • [Hu et al., 2017] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Hu et al., 2018] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Explainable neural computation via stack neural module networks. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [Jang et al., 2017] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • [Kay et al., 2017] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv:1705.06950, 2017.
    Findings
  • [Kingma and Ba, 2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [Kojima et al., 2002] Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. IJCV, 50(2):171–184, 2002.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2017] Xuelong Li, Bin Zhao, Xiaoqiang Lu, et al. Mamrnn: Multi-level attention model based rnn for video captioning. In IJCAI, 2017.
    Google ScholarFindings
  • [Lin, 2004] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, July 2004.
    Google ScholarFindings
  • [Liu et al., 2018] Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. Context-aware visual policy network for sequence-level image captioning. In ACM MM, 2018.
    Google ScholarLocate open access versionFindings
  • [Liu et al., 2019] Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. Learning to assemble neural module tree networks for visual grounding. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • [Pan et al., 2017] Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. Video captioning with transferred semantic attributes. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
    Google ScholarLocate open access versionFindings
  • [Pei et al., 2019] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In CVPR, 2019.
    Google ScholarFindings
  • [Ren et al., 2016] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115:211–252, 2015.
    Google ScholarLocate open access versionFindings
  • [Szegedy et al., 2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inceptionresnet and the impact of residual connections on learning. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • [Tian and Oh, 2019] Junjiao Tian and Jean Oh. Image captioning with compositional neural module networks. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • [Vedantam et al., 2015] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.
    Google ScholarFindings
  • [Venugopalan et al., 2015] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018] Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruction network for video captioning. In CVPR, 2018.
    Google ScholarFindings
  • [Wang et al., 2019] Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. Controllable video captioning with pos sequence guidance based on gated fusion network. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • [Xu et al., 2016] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msrvtt: A large video description dataset for bridging video and language. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2019a] T. Yang, Z. Zha, and H. Zhang. Making history matter: History-advantage sequence training for visual dialog. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2019b] Xu Yang, Hanwang Zhang, and Jianfei Cai. Learning to collocate neural modules for image captioning. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • [Yao et al., 2015] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [Zha et al., 2019] Z. Zha, D. Liu, H. Zhang, Y. Zhang, and F. Wu. Context-aware visual policy network for fine-grained image captioning. T-PAMI, 2019.
    Google ScholarLocate open access versionFindings
  • [Zhang and Peng, 2019] Junchao Zhang and Yuxin Peng. Objectaware aggregation with bidirectional temporal graph for video captioning. In CVPR, 2019.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments