Explainable and Explicit Visual Reasoning over Scene Graphs

CVPR, Volume abs/1812.01855, 2019, Pages 8376-8384.

Cited by: 21|Bibtex|Views135|Links
EI
Keywords:
X neural modulesmodule networkimage recognitionvisual reasoningspeech recognitionMore(5+)
Weibo:
We proposed X neural modules that allows visual reasoning over scene graphs, represented by different detection qualities

Abstract:

We aim to dismantle the prevalent black-box neural architectures used in complex visual reasoning tasks, into the proposed eXplainable and eXplicit Neural Modules (XNMs), which advance beyond existing neural module networks towards using scene graphs --- objects as nodes and the pairwise relationships as edges --- for explainable and expl...More

Code:

Data:

0
Introduction
  • The prosperity of A.The author .s— mastering super-human skills in game playing [22], speech recognition [1], and image recognition [8, 20] — is mainly attributed to the “winning streak” of connectionism, the deep neural networks [15], over the “old-school” symbolism, where their controversy can be dated back to the birth of A.The author .sin 1950s [18].
  • — mastering super-human skills in game playing [22], speech recognition [1], and image recognition [8, 20] — is mainly attributed to the “winning streak” of connectionism, the deep neural networks [15], over the “old-school” symbolism, where their controversy can be dated back to the birth of A.The author .sin 1950s [18].
  • With massive training data and powerful computing resources, the key advantage of deep neural networks is the end-to-end design that generalizes to a large spec-.
Highlights
  • The prosperity of A.I. — mastering super-human skills in game playing [22], speech recognition [1], and image recognition [8, 20] — is mainly attributed to the “winning streak” of connectionism, the deep neural networks [15], over the “old-school” symbolism, where their controversy can be dated back to the birth of A.I. in 1950s [18]
  • With massive training data and powerful computing resources, the key advantage of deep neural networks is the end-to-end design that generalizes to a large spec
  • We show qualitative results to demonstrate that our X neural modules reasoning is highly explainable and explicit
  • A1: When using the ground-truth scene graphs and programs, we can achieve 100% accuracy, indicating an inspiring upper-bound of visual reasoning
  • We follow StackNMN [9] to build the module program in a stacked soft manner, but our model can achieve better performance as our reasoning over scene graphs is more powerful than their pixel-level operations
  • We proposed X neural modules (XNMs) that allows visual reasoning over scene graphs, represented by different detection qualities
Methods
  • Compared to existing neural module networks, XNMs disentangle the “high-level” reasoning from the “low-level” visual perception, and allow them to pay more attention to teaching A.The author .show to “think”, regardless of what they “look”.
  • The authors believe that this is an inspiring direction towards explainable machine reasoning.
  • The authors' experimental results suggest that visual reasoning benefits a lot from high-quality scene graphs, revealing the practical significance of the scene graph research
Results
  • Experimental results are listed in Table 2.
  • When using the detected scene graphs, where node embeddings are RoI features that fuse all attribute values, the generalization results on Condition B drops to 72.1%, suffering from the dataset shortcut just like other existing models [13, 17].
  • In the GT setting, which is given the ground-truth visual labels, the authors can achieve a perfect performance
  • This gap reveals that the challenge of CLEVR-CoGenT mostly comes from the vision bias, rather than the reasoning shortcut.
  • The authors' XNMs are flexible enough for different cases
Conclusion
  • The authors proposed X neural modules (XNMs) that allows visual reasoning over scene graphs, represented by different detection qualities.
Summary
  • Introduction:

    The prosperity of A.The author .s— mastering super-human skills in game playing [22], speech recognition [1], and image recognition [8, 20] — is mainly attributed to the “winning streak” of connectionism, the deep neural networks [15], over the “old-school” symbolism, where their controversy can be dated back to the birth of A.The author .sin 1950s [18].
  • — mastering super-human skills in game playing [22], speech recognition [1], and image recognition [8, 20] — is mainly attributed to the “winning streak” of connectionism, the deep neural networks [15], over the “old-school” symbolism, where their controversy can be dated back to the birth of A.The author .sin 1950s [18].
  • With massive training data and powerful computing resources, the key advantage of deep neural networks is the end-to-end design that generalizes to a large spec-.
  • Methods:

    Compared to existing neural module networks, XNMs disentangle the “high-level” reasoning from the “low-level” visual perception, and allow them to pay more attention to teaching A.The author .show to “think”, regardless of what they “look”.
  • The authors believe that this is an inspiring direction towards explainable machine reasoning.
  • The authors' experimental results suggest that visual reasoning benefits a lot from high-quality scene graphs, revealing the practical significance of the scene graph research
  • Results:

    Experimental results are listed in Table 2.
  • When using the detected scene graphs, where node embeddings are RoI features that fuse all attribute values, the generalization results on Condition B drops to 72.1%, suffering from the dataset shortcut just like other existing models [13, 17].
  • In the GT setting, which is given the ground-truth visual labels, the authors can achieve a perfect performance
  • This gap reveals that the challenge of CLEVR-CoGenT mostly comes from the vision bias, rather than the reasoning shortcut.
  • The authors' XNMs are flexible enough for different cases
  • Conclusion:

    The authors proposed X neural modules (XNMs) that allows visual reasoning over scene graphs, represented by different detection qualities.
Tables
  • Table1: Our composite modules (the top section) and output modules (the bottom section). MLP() consists of several linear and ReLU layers
  • Table2: Comparisons between neural module networks on the CLEVR dataset. Top section: results of the official test set; Bottom section: results of the validation set (we can only evaluate our GT setting on the validation set since the annotations of the test set are not public [<a class="ref-link" id="c12" href="#r12">12</a>]). The program option “scratch” means totally without program annotations, “supervised” means using trained end-to-end parser, and “GT” means using ground-truth programs. Our reasoning modules are composed with highly-reusable X modules, leading to a very small number of parameters. Using the ground-truth scene graphs and programs, we can achieve a perfect reasoning on all kinds of questions
  • Table3: Comparisons between NMNs on CLEVRCoGenT. Top section: results of the test set; Bottom section: results of the validation set. Using the ground-truth scene graphs, our XNMs generalize very well and do not suffer from shortcuts at all
  • Table4: Single-model results on VQAv2.0 validation set and test set. †: values reported in the original papers
Related work
  • Visual Reasoning. It is the process of analyzing visual information and solving problems based on it. The most representative benchmark of visual reasoning is CLEVR [12], a diagnostic visual Q&A dataset for compositional language and elementary visual reasoning. The majority of existing methods on CLEVR can be categorized into two families: 1) holistic approaches [12, 21, 19, 11], which embed both the image and question into a feature space and infer the answer by feature fusion; 2) neural module approaches [3, 10, 13, 17, 9, 26], which first parse the question into a program assembly of neural modules, and then execute the modules over the image features for visual reasoning. Our XNM belongs to the second one but replaces the visual feature input with scene graphs.

    Neural Module Networks. They dismantle a complex question into several sub-tasks, which are easier to answer and more transparent to follow the intermediate outputs. Modules are pre-defined neural networks that implement the corresponding functions of sub-tasks, and then are assembled into a layout dynamically, usually by a sequenceto-sequence program generator given the input question. The assembled program is finally executed for answer prediction [10, 13, 17]. In particular, the program generator is trained based on the human annotations of desired layout or with the help of reinforcement learning due to the nondifferentiability of layout selection. Recently, Hu et al [9] proposed StackNMN, which replaces the hard-layout with soft and continuous module layout and performs well even without layout annotations at all. Our XNM experiments on VQAv2.0 follows their soft-program generator.
Funding
  • The work is supported by NSFC key projects (U1736204, 61661146007, 61533018), Ministry of Education and China Mobile Research Fund (No 20181770250), THUNUS NExT Co-Lab, and AlibabaNTU JRI
Reference
  • D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, 2016. 1
    Google ScholarLocate open access versionFindings
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018. 8
    Google ScholarLocate open access versionFindings
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.F. Chang. Scene dynamics: Counterfactual critic multiagent training for scene graph generation. arXiv preprint arXiv:1812.02347, 2018. 3
    Findings
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 1, 2, 3, 8
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2016
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1
    Google ScholarLocate open access versionFindings
  • R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable neural computation via stack neural module networks. In ECCV, 2018. 2, 3, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017. 2, 3, 6, 8
    Google ScholarLocate open access versionFindings
  • D. A. Hudson and C. D. Manning. Compositional attention networks for machine reasoning. ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. Inferring and executing programs for visual reasoning. In ICCV, 2017. 1, 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 201
    Google ScholarLocate open access versionFindings
  • Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang. Factorizable net: an efficient subgraph-based framework for scene graph generation. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In CVPR, 2018. 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • M. L. Minsky. Logical versus analogical or symbolic versus connectionist or neat versus scruffy. AI magazine, 12(2):34, 1991. 1
    Google ScholarLocate open access versionFindings
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. AAAI, 2018. 2
    Google ScholarFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1
    Google ScholarLocate open access versionFindings
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016. 1
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 5
    Google ScholarLocate open access versionFindings
  • D. Teney, L. Liu, and A. van den Hengel. Graphstructured representations for visual question answering. arXiv preprint, 2017. 3
    Google ScholarFindings
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. NIPS, 2018. 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • X. Yin and V. Ordonez. Obj2text: Generating visually descriptive language from object layouts. In EMNLP, 2017. 3
    Google ScholarLocate open access versionFindings
  • R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Y. Zhang, J. Hare, and A. Prugel-Bennett. Learning to count objects in natural images for visual question answering. arXiv preprint arXiv:1802.05766, 2018. 4
    Findings
Your rating :
0

 

Tags
Comments