Long-term Recurrent Convolutional Networks for Visual Recognition and Description

IEEE transactions on pattern analysis and machine intelligence, Volume 39, Issue 4, 2017, Pages 677-691.

Cited by: 4858|Views650
EI WOS
Keywords:
Weibo:
We show here that long-term recurrent convolutional models are generally applicable to visual time-series modeling; we argue that in visual tasks where static or flat temporal models have previously been employed, long-term Recurrent Neural Networks can provide significant improv...

Abstract:

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visu...More

Code:

Data:

0
Introduction
  • Recognition and description of images and videos is a fundamental challenge of computer vision.
  • CNN models for video processing have successfully considered learning of 3-D spatio-temporal filters over raw sequence data [13, 2], and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments [16, 33]
  • Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling.
Highlights
  • Recognition and description of images and videos is a fundamental challenge of computer vision
  • We show here that long-term recurrent convolutional models are generally applicable to visual time-series modeling; we argue that in visual tasks where static or flat temporal models have previously been employed, long-term Recurrent Neural Networks can provide significant improvement when ample training data are available to learn or refine the representation
  • Traditional Recurrent Neural Networks (Figure 2, left) can learn complex temporal dynamics by mapping input sequences to a sequence of hidden states, and hidden states to outputs via the following recurrence equations (Figure 2, left): ht = g(Wxhxt + Whhht 1 + bh) zt = g(Whzht + bz) where g is an element-wise non-linearity, such as a sigmoid or hyperbolic tangent, xt is the input, ht 2 RN is the hidden state with hidden units, and is the output at time
  • We’ve presented long-term recurrent convolutional networks, a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs
  • Our results consistently demonstrate that by learning sequential dynamics with a deep sequence model, we can improve on previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence
  • The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to produce with little input preprocessing and no hand-designed features
Results
  • The authors evaluate the image description model for retrieval and generation tasks. The authors first show the effectiveness of the model by quantitatively evaluating it on the image and caption retrieval tasks proposed by [26] and seen in [25, 15, 36, 8, 18].
  • The results show that (1) the LSTM outperforms an SMT-based approach to video description; (2) the simpler decoder architecture (b) and (c) achieve better performance than (a), likely because the input does not need to be memorized; and (3) the approach achieves 28.8%, clearly outperforming the best reported number of 26.9% on TACoS multilevel by [29].
Conclusion
  • The authors have presented LRCN, a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs.
  • The authors' results consistently demonstrate that by learning sequential dynamics with a deep sequence model, the authors can improve on previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.
  • The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to produce with little input preprocessing and no hand-designed features
Summary
  • Introduction:

    Recognition and description of images and videos is a fundamental challenge of computer vision.
  • CNN models for video processing have successfully considered learning of 3-D spatio-temporal filters over raw sequence data [13, 2], and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments [16, 33]
  • Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling.
  • Results:

    The authors evaluate the image description model for retrieval and generation tasks. The authors first show the effectiveness of the model by quantitatively evaluating it on the image and caption retrieval tasks proposed by [26] and seen in [25, 15, 36, 8, 18].
  • The results show that (1) the LSTM outperforms an SMT-based approach to video description; (2) the simpler decoder architecture (b) and (c) achieve better performance than (a), likely because the input does not need to be memorized; and (3) the approach achieves 28.8%, clearly outperforming the best reported number of 26.9% on TACoS multilevel by [29].
  • Conclusion:

    The authors have presented LRCN, a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs.
  • The authors' results consistently demonstrate that by learning sequential dynamics with a deep sequence model, the authors can improve on previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.
  • The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to produce with little input preprocessing and no hand-designed features
Tables
  • Table1: Activity recognition: Comparing single frame models to LRCN networks for activity recognition in the UCF-101 [<a class="ref-link" id="c37" href="#r37">37</a>] dataset, with both RGB and flow inputs. Values for split-1 as well as the average across all three splits are shown. Our LRCN model consistently and strongly outperforms a model based on predictions from the underlying convolutional network architecture alone. On split-1, we show that placing the LSTM on fc6 performs better than fc7
  • Table2: Image description: retrieval results for the Flickr30k [<a class="ref-link" id="c28" href="#r28">28</a>] and COCO2014 [<a class="ref-link" id="c24" href="#r24">24</a>] datasets. R@K is the average recall at rank K (high is good). Medr is the median rank (low is good). Note that [<a class="ref-link" id="c18" href="#r18">18</a>] achieves better retrieval performance using a stronger CNN architecture see text
  • Table3: Image description: Sentence generation results (BLEU scores (%) – ours are adjusted with the brevity penalty) for the Flickr30k [<a class="ref-link" id="c28" href="#r28">28</a>] and COCO 2014 [<a class="ref-link" id="c24" href="#r24">24</a>] test sets
  • Table4: Image description: Human evaluator rankings from 1-6 (low is good) averaged for each method and criterion. We evaluated on 785 Flickr images selected by the authors of [<a class="ref-link" id="c18" href="#r18">18</a>] for the purposes of comparison against this similar contemporary approach
  • Table5: Video description: Results on detailed description of TACoS multilevel[<a class="ref-link" id="c29" href="#r29">29</a>], in %, see Section C.3 for details
Download tables as Excel
Funding
  • This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Vision and Learning Center
  • The GPUs used for this research were donated by the NVIDIA Corporation. Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD)
Reference
  • M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Action classification in soccer videos with long short-term memory recurrent neural networks. In ICANN. 2010. 4, 5
    Google ScholarLocate open access versionFindings
  • M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding. 2011. 2, 4, 5
    Google ScholarFindings
  • A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang. Video in sentences out. In UAI, 2012. 7
    Google ScholarLocate open access versionFindings
  • T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV. 2005
    Google ScholarLocate open access versionFindings
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014. 2, 3, 7
    Findings
  • P. Das, C. Xu, R. Doell, and J. Corso. Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013. 7
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 5
    Google ScholarLocate open access versionFindings
  • A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In NIPS, 2013. 6, 7
    Google ScholarLocate open access versionFindings
  • A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. 3
    Findings
  • A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014. 2, 3
    Google ScholarLocate open access versionFindings
  • S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In ICCV, 2013. 7
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997. 2, 3
    Google ScholarLocate open access versionFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):221– 231, 202, 4
    Google ScholarLocate open access versionFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 202, 5, 6
    Google ScholarLocate open access versionFindings
  • A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014. 6, 7
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 2, 4, 5, 12
    Google ScholarLocate open access versionFindings
  • M. U. G. Khan, L. Zhang, and Y. Gotoh. Human focused video description. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011. 7
    Google ScholarLocate open access versionFindings
  • R. Kiros, R. Salakhuditnov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. 6, 7
    Findings
  • R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, 2014. 6, 15
    Google ScholarLocate open access versionFindings
  • R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neural language models. In Proc. NIPS Deep Learning Workshop, 2013. 6
    Google ScholarLocate open access versionFindings
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical machine translation. In ACL, 2007. 7, 8
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 4, 5, 6
    Google ScholarLocate open access versionFindings
  • P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2(10):351–362, 2014. 7
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014. 6, 7, 13, 16
    Findings
  • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014. 6, 7
    Findings
  • P. Y. Micah Hodosh and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47:853–899, 2013. 6
    Google ScholarLocate open access versionFindings
  • K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002. 6
    Google ScholarLocate open access versionFindings
  • M. H. Peter Young, Alice Lai and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014. 6, 7
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In GCPR, 2014. 8, 18, 20
    Google ScholarLocate open access versionFindings
  • M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013. 2, 7, 8
    Google ScholarLocate open access versionFindings
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985. 2
    Google ScholarFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014. 5, 6
    Google ScholarFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199, 2014. 2, 4, 5, 12
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 4
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 6
    Google ScholarLocate open access versionFindings
  • R. Socher, Q. Le, C. Manning, and A. Ng. Grounded compositional semantics for finding and describing images with sentences. In NIPS Deep Learning Workshop, 2013. 6, 7
    Google ScholarFindings
  • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5, 6, 14
    Findings
  • I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, 2011. 2
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 2, 3, 7
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014. 4
    Findings
  • C. C. Tan, Y.-G. Jiang, and C.-W. Ngo. Towards textually describing complex video contents with audio-visual concept classifiers. In MM, 2011. 7
    Google ScholarLocate open access versionFindings
  • J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, 2014. 7
    Google ScholarLocate open access versionFindings
  • O. Vinyals, S. V. Ravuri, and D. Povey. Revisiting recurrent neural networks for robust ASR. In ICASSP, 2012. 2
    Google ScholarLocate open access versionFindings
  • H. Wang, A. Klaser, C. Schmid, and C. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 2013. 8
    Google ScholarLocate open access versionFindings
  • R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989. 2
    Google ScholarLocate open access versionFindings
  • W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014. 3
    Findings
  • W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014. 2, 4
    Findings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV. 2014. 5
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments