COBRA: Contrastive Bi-Modal Representation Learning

Vishaal Udandarao,Abhishek Maiti,Deepak Srivatsav,Suryatej Reddy Vyalla,Yifang Yin,Rajiv Ratn Shah

semanticscholar（2020）

引用 0|浏览1

暂无评分

摘要

There are a wide range of applications that involve multi-modal data, such as cross-modal retrieval, visual question-answering and image captioning. Such applications are primarily dependent on aligned distributions of the different constituent modalities. Existing approaches generate latent embeddings for each modality in a joint fashion by representing them in a common manifold. However these joint embedding spaces fail to sufficiently reduce the modality gap, which affects the performance in downstream tasks. We hypothesize that these embeddings retain the intra-class relationships but are unable to preserve the inter-class dynamics. In this paper, we present a novel framework COBRA that aims to train two modalities (i.e., image and text) in a joint fashion inspired by the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms which preserve both inter-class and intra-class relationships. We have conducted extensive experiments on two downstream tasks spanning across three benchmark cross-modal datasets. These show that our proposed framework achieves state-of-the-art results and outperforms existing work, as it generates a robust and task agnostic joint-embedding space.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要