Multi-modal fusion architecture search for camera-based semantic scene completion

EXPERT SYSTEMS WITH APPLICATIONS(2024)

引用 0|浏览23
暂无评分
摘要
Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB image will help SSC to some extent, the depth prediction quality is far from the demand of SSC. To solve this problem, we propose a NAS-based multi-modal fusion method to incorporate the semantic and geometry information from other intermediate representations (predicted depth and predicted 2D segmentation) to form a more robust 2D feature representation. A key idea of this design is that explicit 2D semantic information could alleviate the misleading information of 3D distortions introduced by estimated depth. Specifically, we propose the Confidence-Block to automatically learn an optimal architecture for routing and obtaining the depth prediction confidence. We propose the two-level fusion search space by decomposing the fusion search space into fusion stage search space and fusion operation search space. Moreover, we propose a confidence aware 2D-3D projection module to alleviate the 3D projection error. Extensive experiments show that our method outperforms the state-of-the-art method by a large margin using a single RGB image on NYU and NYUCAD datasets.
更多
查看译文
关键词
Semantic scene completion,NAS (neural architecture search),Multi-modal fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要