Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
CoRR(2024)
摘要
In video-text retrieval, most existing methods adopt the dual-encoder
architecture for fast retrieval, which employs two individual encoders to
extract global latent representations for videos and texts. However, they face
challenges in capturing fine-grained semantic concepts. In this work, we
propose the UNIFY framework, which learns lexicon representations to capture
fine-grained semantics and combines the strengths of latent and lexicon
representations for video-text retrieval. Specifically, we map videos and texts
into a pre-defined lexicon space, where each dimension corresponds to a
semantic concept. A two-stage semantics grounding approach is proposed to
activate semantically relevant dimensions and suppress irrelevant dimensions.
The learned lexicon representations can thus reflect fine-grained semantics of
videos and texts. Furthermore, to leverage the complementarity between latent
and lexicon representations, we propose a unified learning scheme to facilitate
mutual learning via structure sharing and self-distillation. Experimental
results show our UNIFY framework largely outperforms previous video-text
retrieval methods, with 4.8
DiDeMo respectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要