Multi-modal fusion architecture search for camera-based semantic scene completion

EXPERT SYSTEMS WITH APPLICATIONS（2024）

引用 0|浏览23

暂无评分

摘要

Camera-based Semantic scene completion (SSC) aims to infer the 3D volumetric occupancy and semantic categories of a scene simultaneously from a single RGB image. The main challenge of camera-based SSC is the lack of geometry information compared with RGB-D SSC. Although the estimated depth from RGB image will help SSC to some extent, the depth prediction quality is far from the demand of SSC. To solve this problem, we propose a NAS-based multi-modal fusion method to incorporate the semantic and geometry information from other intermediate representations (predicted depth and predicted 2D segmentation) to form a more robust 2D feature representation. A key idea of this design is that explicit 2D semantic information could alleviate the misleading information of 3D distortions introduced by estimated depth. Specifically, we propose the Confidence-Block to automatically learn an optimal architecture for routing and obtaining the depth prediction confidence. We propose the two-level fusion search space by decomposing the fusion search space into fusion stage search space and fusion operation search space. Moreover, we propose a confidence aware 2D-3D projection module to alleviate the 3D projection error. Extensive experiments show that our method outperforms the state-of-the-art method by a large margin using a single RGB image on NYU and NYUCAD datasets.

查看译文

关键词

Semantic scene completion,NAS (neural architecture search),Multi-modal fusion

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要