TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval.

Yue Ruan,Han-Hung Lee,Yiming Zhang, Ke Zhang,Angel X. Chang

IEEE/CVF Winter Conference on Applications of Computer Vision（2024）

引用 0|浏览0

暂无评分

摘要

Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data. Recent work on contrastive losses for learning joint embeddings over multimodal data [45] has been successful at tasks such as retrieval and classification. Thus far, work on joint representation learning for 3D shapes and text has focused on improving embeddings through modeling of complex attention between representations [53], or multi-task learning [25]. We propose a trimodal learning scheme over text, multi-view images and 3D shape voxels, and show that with large batch contrastive learning we achieve good performance on text-to-shape retrieval without complex attention mechanisms or losses. Our experiments serve as a foundation for follow-up work on building trimodal embeddings for text-image-shape.

查看译文

关键词

Algorithms,Vision + language and/or other modalities,Algorithms,3D computer vision

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要