Joint representation learning for text and 3D point cloud

PATTERN RECOGNITION(2024)

引用 2|浏览42
暂无评分
摘要
Recent advancements in vision-language pre-training (e.g., CLIP) have enabled 2D vision models to benefit from language supervision. However, the joint representation learning of 3D point cloud with text remains under -explored due to challenges in acquiring 3D-Text data pairs. Prior works propose to project point clouds into 2D depth maps and directly use CLIP, while they sacrifice 3D structural information, limiting its applicability. In this paper, we put forward Text4Point, a novel framework to construct language-guided 3D models for dense prediction tasks. Text4Point utilizes 2D images as a bridge to connect the point cloud and language modalities. It follows a pre-training and fine-tuning paradigm. During pre-training, we leverage dense contrastive learning to align the image and point cloud representations using the readily available RGB-D data. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns 3D representations under informative language guidance without 2D images. Extensive experiments demonstrate consistent improvement on various dense prediction tasks with Text4Point.
更多
查看译文
关键词
Point cloud,Multi-modal learning,Representation learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要