Cross-Modal and Hierarchical Modeling of Video and Text
COMPUTER VISION - ECCV 2018, PT XIII(2018)
摘要
Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.
更多查看译文
关键词
Hierarchical sequence embedding, Video text retrieval, Video description generation, Action recognition, Zero-shot transfer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络