Weakly Supervised 3D Open-vocabulary Segmentation

NeurIPS(2023)

引用 0|浏览31
暂无评分
摘要
Open-vocabulary segmentation of 3D scenes is a fundamental function of human\nperception and thus a crucial objective in computer vision research. However,\nthis task is heavily impeded by the lack of large-scale and diverse 3D\nopen-vocabulary segmentation datasets for training robust and generalizable\nmodels. Distilling knowledge from pre-trained 2D open-vocabulary segmentation\nmodels helps but it compromises the open-vocabulary feature as the 2D models\nare mostly finetuned with close-vocabulary datasets. We tackle the challenges\nin 3D open-vocabulary segmentation by exploiting pre-trained foundation models\nCLIP and DINO in a weakly supervised manner. Specifically, given only the\nopen-vocabulary text descriptions of the objects in a scene, we distill the\nopen-vocabulary multimodal knowledge and object reasoning capability of CLIP\nand DINO into a neural radiance field (NeRF), which effectively lifts 2D\nfeatures into view-consistent 3D segmentation. A notable aspect of our approach\nis that it does not require any manual segmentation annotations for either the\nfoundation models or the distillation process. Extensive experiments show that\nour method even outperforms fully supervised models trained with segmentation\nannotations in certain scenes, suggesting that 3D open-vocabulary segmentation\ncan be effectively learned from 2D images and text-image pairs. Code is\navailable at .
更多
查看译文
关键词
weakly supervised 3d,open-vocabulary
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要