Tree-augmented Cross-Modal Encoding for Complex-Query Video Retrieval

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 1339-1348, 2020.

Cited by: 0|Bibtex|Views98|Links
EI
Keywords:
cross modalMedian ranktext querycomplex query video retrievalmodal encodingMore(14+)
Weibo:
We proposed a novel framework for complex-query video retrieval, which consists of a tree-based complex query encoder and a temporal attentive video encoder

Abstract:

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm...More

Code:

Data:

0
Introduction
  • With the exponential growth of user-generated videos on the Internet, searching the videos of interest has been an indispensable activity in people’s daily lives.
  • Existing efforts on video retrieval with complex queries can be roughly categorized into two groups: 1) Concept-based paradigm [18, 24, 25, 31, 41, 52, 53], as shown in Figure 1 (a)
  • It usually uses a large set of visual concepts to describe the video content, transforms the text query into a set of primitive concepts, and performs video retrieval by aggregating the matching results from different concepts [53].
  • The embedding-based methods have shown much better performance, treating queries holistically as one dense vector representations may obfuscate the keywords or phrases that have rich temporal and semantic cues
Highlights
  • With the exponential growth of user-generated videos on the Internet, searching the videos of interest has been an indispensable activity in people’s daily lives
  • The natural language queries are usually transformed into dense vector representations by Recurrent Neural Networks (RNNs) [34] (e.g., Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)) that are powerful for modeling sequence data
  • We evaluate our method in the multi-modality setting on Large Scale Movie Description Challenge (LSMDC) (See Table 2)
  • LSMDC: Table 2 compares the performance of Tree-augmented Cross-modal Encoding (TCE) with most reported results on the LSMDC video clip retrieval task
  • We proposed a novel framework for complex-query video retrieval, which consists of a tree-based complex query encoder and a temporal attentive video encoder
  • We conduct extensive experiments on large-scale datasets to demonstrate that our approach can achieve state-of-the-art retrieval performance
  • This work provides a novel direction for complex-query video retrieval by automatically transforming the complex query into an easy-to-interpret structure without any syntactic rules and annotations
Methods
  • Data split from [27] Random CCA [43] MEE [27] MMEN (Caption) [43] JPoSE [43] TCE.
  • The temporal attention module with the hidden size of dva =256 aggregates the outputs of multi-head attention module and produces a video representation with the dimension of dv =512.
  • The number of hard negative samples used in Eq (16) is 5
Results
  • Experimental Results and Analysis

    4.2.1 Comparison with State-of-the-Arts. To answer the research question R1, the authors compare the proposed Tree-augmented Crossmodal Encoding (TCE) with recently proposed state-of-the-art methods: (1) RNN-based methods: DualEncoding [7], Kaufman et al [14], CT-SAN [51], SNUVL [50], C+LSTM+SA+FC7 [40], and VSE-LSTM [15], (2) Multimodal Fusion methods: Mithun et al [30] , MEE [27], MMEN [43], and JPoSE [43], and (3) other methods: JSFusion [49], CCA (FV HGLMM) [16], and Miech et al [26].
  • MSR-VTT: Table 1 clearly shows that the proposed TCE outperforms all other available methods in all three dataset splits.
  • DualEncoding is the best reported state-of-the-art method on the first split that fuses multi-levels textual and video features for joint embedding learning with the embedding size of 2048.
  • For the second split [27], the authors observe a large improvement over the state-of-the-art JPoSE which disentangles the text query into multiple semantic spaces (Verb, Noun) for score-level fusion.
  • Since the multi-modal fusion is not the focus in this paper, the authors leave the fusion experiment on MSR-VTT for future study
Conclusion
  • The authors proposed a novel framework for complex-query video retrieval, which consists of a tree-based complex query encoder and a temporal attentive video encoder
  • It first automatically composes a latent semantic tree from words to model the user query based on a memory-augmented node scoring and selection strategy and encodes the tree into a structure-aware query representation based on an attention mechanism.
  • The authors are interested in exploring the external knowledge to enhance the text representation learning and the tree construction [2, 3] in the future study
Summary
  • Introduction:

    With the exponential growth of user-generated videos on the Internet, searching the videos of interest has been an indispensable activity in people’s daily lives.
  • Existing efforts on video retrieval with complex queries can be roughly categorized into two groups: 1) Concept-based paradigm [18, 24, 25, 31, 41, 52, 53], as shown in Figure 1 (a)
  • It usually uses a large set of visual concepts to describe the video content, transforms the text query into a set of primitive concepts, and performs video retrieval by aggregating the matching results from different concepts [53].
  • The embedding-based methods have shown much better performance, treating queries holistically as one dense vector representations may obfuscate the keywords or phrases that have rich temporal and semantic cues
  • Objectives:

    This paper aims to model complex queries in a more flexible structure to facilitate the joint learning of the representations of the queries and videos in a unified framework.
  • The authors aim to answer the following research questions via extensive experiments: (1) R1: How does the proposed method perform compared with state-of-the-art methods?
  • (3) R3: How does the proposed method perform on different types of complex queries?
  • Can the latent semantic tree help to better understand the complex query and drive stronger query representation?
  • The authors aim to answer the following research questions via extensive experiments: (1) R1: How does the proposed method perform compared with state-of-the-art methods? (2) R2: What are the impacts of different components on the overall performance of the approach? (3) R3: How does the proposed method perform on different types of complex queries? Can the latent semantic tree help to better understand the complex query and drive stronger query representation?
  • Methods:

    Data split from [27] Random CCA [43] MEE [27] MMEN (Caption) [43] JPoSE [43] TCE.
  • The temporal attention module with the hidden size of dva =256 aggregates the outputs of multi-head attention module and produces a video representation with the dimension of dv =512.
  • The number of hard negative samples used in Eq (16) is 5
  • Results:

    Experimental Results and Analysis

    4.2.1 Comparison with State-of-the-Arts. To answer the research question R1, the authors compare the proposed Tree-augmented Crossmodal Encoding (TCE) with recently proposed state-of-the-art methods: (1) RNN-based methods: DualEncoding [7], Kaufman et al [14], CT-SAN [51], SNUVL [50], C+LSTM+SA+FC7 [40], and VSE-LSTM [15], (2) Multimodal Fusion methods: Mithun et al [30] , MEE [27], MMEN [43], and JPoSE [43], and (3) other methods: JSFusion [49], CCA (FV HGLMM) [16], and Miech et al [26].
  • MSR-VTT: Table 1 clearly shows that the proposed TCE outperforms all other available methods in all three dataset splits.
  • DualEncoding is the best reported state-of-the-art method on the first split that fuses multi-levels textual and video features for joint embedding learning with the embedding size of 2048.
  • For the second split [27], the authors observe a large improvement over the state-of-the-art JPoSE which disentangles the text query into multiple semantic spaces (Verb, Noun) for score-level fusion.
  • Since the multi-modal fusion is not the focus in this paper, the authors leave the fusion experiment on MSR-VTT for future study
  • Conclusion:

    The authors proposed a novel framework for complex-query video retrieval, which consists of a tree-based complex query encoder and a temporal attentive video encoder
  • It first automatically composes a latent semantic tree from words to model the user query based on a memory-augmented node scoring and selection strategy and encodes the tree into a structure-aware query representation based on an attention mechanism.
  • The authors are interested in exploring the external knowledge to enhance the text representation learning and the tree construction [2, 3] in the future study
Tables
  • Table1: State-of-the-art performance comparison (%) on MSR-VTT with different dataset splits. Note that TCE uses bidirectional GRU and LSTM for better performance in this experiment based on 1024-D query and video embeddings
  • Table2: State-of-the-art performance comparison (%) on LSMDC [<a class="ref-link" id="c33" href="#r33">33</a>]. Our TCE performs the best with a much lowerdimensional embedding (512-D). The Mot. and Aud. refer to the motion feature and audio feature, respectively
  • Table3: Ablation studies on the MSR-VTT dataset using the standard dataset split [<a class="ref-link" id="c44" href="#r44">44</a>] to investigate the effects of the tree-based query encoder and the temporal-attentive video encoder. The proposed method performs the best
Download tables as Excel
Related work
  • In this section, we briefly introduce two representative research directions in text-based video retrieval. One is the concept based methods and the other one is the embedding based methods.

    Concept based methods [18, 24, 25, 31, 41] mainly rely on establishing cross-modal associations via concepts [12]. Markatopoulou et al [24, 25] first utilized relatively complex linguistic rules to extract relevant concepts from a given query and used pre-trained CNNs to detect the objects and scenes in video frames. Then the similarity between a given query and a specific video is measured by concept matching. Ueki et al [41] depended on a much larger concept vocabulary. In addition to pre-trained CNNs, they additionally trained SVM-based classifiers to automatically annotate the videos. Snoek et al [38] trained a more elegant model, called VideoStory, from freely available web videos to annotate videos, while they still represented the textual query by selecting concepts based on part-of-speech tagging heuristically. Despite the promising performance, the concept based methods still face many challenges, e.g., how to specify a set of concepts and how to extract relevant concepts for both textual queries and videos. Moreover, the extraction of concepts from videos and textual queries are usually treated independently, which makes it suboptimal to explore the relations between two modalities. In contrast, our method is concept free and jointly learns the representation of textual queries and videos.
Funding
  • This research is supported by The National Key Research and Development Program of China under grant 2018YFB0804205, the National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative, the National Natural Science Foundation of China under grant 61902347, and the Zhejiang Provincial Natural Science Foundation under grant LQ19F020002
Reference
  • Da Cao, Zhiwang Yu, Hanling Zhang, Jiansheng Fang, Liqiang Nie, and Qi Tian. 2019. Video-Based Cross-Modal Recipe Retrieval. In MM. ACM, 1685–1693.
    Google ScholarLocate open access versionFindings
  • Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, Chengjiang Li, Xu Chen, and Tiansi Dong. 2018. Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision. In EMNLP. 227–237.
    Google ScholarLocate open access versionFindings
  • Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In ACL. 1623–1633.
    Google ScholarFindings
  • Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to compose taskspecific tree structures. In AAAI.
    Google ScholarFindings
  • Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. 2016. Capacity and trainability in recurrent neural networks. In ICLR.
    Google ScholarFindings
  • Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
    Google ScholarLocate open access versionFindings
  • Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. In CVPR. IEEE, 9346–9355.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Daniel Jiwoong Im, and Victor OK Li. 201Neural machine translation with gumbel-greedy decoding. In AAAI.
    Google ScholarFindings
  • Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In NeurIPS. 678–688.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE, 770–778.
    Google ScholarLocate open access versionFindings
  • Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Transactions on Image Processing 26, 9 (2017), 4128–4138.
    Google ScholarLocate open access versionFindings
  • Richang Hong, Yang Yang, Meng Wang, and Xian-Sheng Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. IEEE Transactions on Big Data 1, 4 (2015), 152–161.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448–456.
    Google ScholarLocate open access versionFindings
  • Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. 2017. Temporal tessellation: A unified approach for video analysis. In ICCV. IEEE, 94–104.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visualsemantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
    Findings
  • Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR. IEEE, 4437–4446.
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet Classification using Deep Convolutional Neural Networks. In NeurIPS. 1097–1105.
    Google ScholarLocate open access versionFindings
  • D. Le, S. Phan, V.-T. Nguyen, B. Renoust, T. A. Nguyen, V.-N. Hoang, T. D. Ngo, M.-T. Tran, Y. Watanabe, M. Klinkigt, et al. 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID Workshop.
    Google ScholarLocate open access versionFindings
  • Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In CVPR. IEEE, 1970–1979.
    Google ScholarFindings
  • Xirong Li. 2019. Deep Learning for Video Retrieval by Natural Language. In Proceedings of the 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia. 2–3.
    Google ScholarLocate open access versionFindings
  • Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++: fully deep learning for ad-hoc video search. In MM. ACM, 1786–1794.
    Google ScholarLocate open access versionFindings
  • Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In CVPR. IEEE, 2657–2664.
    Google ScholarLocate open access versionFindings
  • Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In MM. ACM, 843–851.
    Google ScholarLocate open access versionFindings
  • F. Markatopoulou, D. Galanopoulos, V. Mezaris, and I. Patras. 2017. Query and Keyframe Representations for Ad-hoc Video Search. In ICMR. ACM, 407–411.
    Google ScholarLocate open access versionFindings
  • F. Markatopoulou, A. Moumtzidou, D. Galanopoulos, T. Mironidis, V. Kaltsa, A. Ioannidou, S. Symeonidis, K. Avgerinakis, S. Andreadis, et al. 2016. ITI-CERTH Participation in TRECVID 2016. In TRECVID Workshop.
    Google ScholarLocate open access versionFindings
  • Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, and Josef Sivic. 2017. Learning from video and text via large-scale discriminative clustering. In ICCV. IEEE, 5257–5266.
    Google ScholarLocate open access versionFindings
  • Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).
    Findings
  • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. IEEE, 2630–2640.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.
    Google ScholarFindings
  • N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In ICMR. ACM, 19–27.
    Google ScholarLocate open access versionFindings
  • P. A. Nguyen, Q. Li, Z.-Q. Cheng, Y.-J. Lu, H. Zhang, X. Wu, and C.-W. Ngo. 2017. VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking. In TRECVID Workshop.
    Google ScholarLocate open access versionFindings
  • K. Niu, Y. Huang, W. Ouyang, and L. Wang. 2020. Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments. IEEE Transactions on Image Processing 29 (2020), 5542–5556.
    Google ScholarLocate open access versionFindings
  • Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In CVPR. IEEE, 3202–3212.
    Google ScholarLocate open access versionFindings
  • Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
    Google ScholarFindings
  • Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. 2018. Find and Focus: Retrieve and Localize Video Events with Natural Language Queries. In ECCV. 200–216.
    Google ScholarLocate open access versionFindings
  • Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually Grounded Neural Syntax Acquisition. In ACL.
    Google ScholarLocate open access versionFindings
  • Cees GM Snoek, Marcel Worring, et al. 2009. Concept-based video retrieval. Foundations and Trends® in Information Retrieval 2, 4 (2009), 215–322.
    Google ScholarLocate open access versionFindings
  • C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. 2017. University of Amsterdam and Renmin University at TRECVID 2017: Searching Video, Detecting Events and Describing Video. In Proceedings of TRECVID 2017.
    Google ScholarLocate open access versionFindings
  • Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In ACL. 1556–1566.
    Google ScholarFindings
  • Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 (2016).
    Findings
  • K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID Workshop.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998–6008.
    Google ScholarFindings
  • Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. FineGrained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In ICCV. IEEE, 450–459.
    Google ScholarLocate open access versionFindings
  • J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. IEEE, 5288–5296.
    Google ScholarLocate open access versionFindings
  • R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In AAAI.
    Google ScholarFindings
  • Xun Yang, Meng Wang, Richang Hong, Qi Tian, and Yong Rui. 2017. Enhancing person re-identification in a self-trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 1–23.
    Google ScholarLocate open access versionFindings
  • Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2018), 791–805.
    Google ScholarLocate open access versionFindings
  • Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. 2018. Memory architectures in recurrent neural network language models. In ICLR.
    Google ScholarFindings
  • Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV. 471–487.
    Google ScholarLocate open access versionFindings
  • Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video Captioning and Retrieval Models with Semantic Attention. ArXiv abs/1610.02947 (2016).
    Findings
  • Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR. IEEE, 3165–3173.
    Google ScholarLocate open access versionFindings
  • Jin Yuan, Zheng-Jun Zha, Yao-Tao Zheng, Meng Wang, Xiangdong Zhou, and Tat-Seng Chua. 2011. Learning concept bundles for video search with complex queries. In MM. ACM, 453–462.
    Google ScholarLocate open access versionFindings
  • Jin Yuan, Zheng-Jun Zha, Yan-Tao Zheng, Meng Wang, Xiangdong Zhou, and TatSeng Chua. 2011. Utilizing related samples to enhance interactive concept-based video search. IEEE Transactions on Multimedia 13, 6 (2011), 1343–1355.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments