Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views

Cited by: 0|Bibtex|Views7|Links
Keywords:
Replica scenesSemantic MapNetliving roomsemantic mappingquestion answeringMore(11+)
Weibo:
Our experiments focus on top-down semantic segmentation, i.e. each pixel in the top-down map is assigned to a single class label

Abstract:

We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose (via localization sensors). Towards this goal, we ...More

Code:

Data:

0
Introduction
  • Imagine yourself receiving a tour of a new environment. Maybe you visit a friend’s new house and they show you around (‘This is the living room, and down here is the study’).
  • T t t t+1 t+1 t+1 (a) Agent Trajectory (b) Egocentric Observations (c) Spatial Memory (d) Top-down Segmentation perform new tasks in these spaces.
  • At one end, are SSpSpapatiataitaliaFl lFeFeaeatuaturteurerMeMaMapap@p@@ttt approaches (Sengupta et al 2012; Sunderhauf et al 2016; Maturana et al 2018a) that first perform egocentric semantic segmentation and use the known camera pose and the depth of each pixel to project labels to an allocentric map.
Highlights
  • Imagine yourself receiving a tour of a new environment
  • Our experiments focus on top-down semantic segmentation, i.e. each pixel in the top-down map is assigned to a single class label
  • At one end, are SSpSpapatiataitaliaFl lFeFeaeatuaturteurerMeMaMapap@p@@ttt approaches (Sengupta et al 2012; Sunderhauf et al 2016; Maturana et al 2018a) that first perform egocentric semantic segmentation and use the known camera pose and the depth of each pixel to project labels to an allocentric map. We find that this results in ‘label splatter’ – any mistakes in the egocentric semantic segmentation made at the depth boundaries of objects get splattered on the map around the object
  • We demonstrate via extension experiments how representations built by Semantic MapNet (SMNet) from a single tour of an environment can be reused for ObjectNav and Embodied Question Answering (Das et al 2018)
  • Project→Segment achieves low performance compared to the approaches that operate on egocentric images prior to projection. This suggests details lost in the top-down view are important for disambiguating objects – e.g. the chairs at the table in Fig. 4 are difficult to see in the top-down RGB and are completely lost by this approach
  • As our approach reasons over a spatial memory tensor, it can reason about multiple observations of the same point – achieving mBF1 37.02, and mIoU 36.77
Results
  • This is accomplished by projecting egocentric features to appropriate locations in an allocentric spatial memory, and using this memory to decode a top-down semantic segmentation.
  • These works focus on spatial memories as part of an end-to-end agent for a downstream task and do not evaluate the quality of the generated maps in terms of environment semantics directly.
  • Like Semantic MapNet, these approaches project intermediate features into a spatial memory and decode segmentations from that structure.
  • – A Map Decoder that uses the accumulated memory tensor to produce top-down semantic segmentations.
  • The authors choose the Matterport3D scans (Chang et al 2017) with the Habitat simulator (Savva et al 2019) over video segmentation datasets for a number of reasons – Matterport3D provides semantic annotations in 3D, the spaces are large enough to allow multi-room traversal by the agent (as opposed to (Dai et al 2017; Nathan Silberman and Fergus 2012)), and the use of a 3D simulator (as opposed to (Geiger, Lenz, and Urtasun 2012; Cordts et al 2016)) allows them render RGB-D from any viewpoint, create top-down semantic annotations, and study embodied AI applications in the same environments.
  • As depicted in Fig. 2, there exists a spectrum of methodologies for the task based on what is being projected from egocentric observations to the top-down map – pixels, features, or labels.
  • As agents traverse the scene, the observed RGB pixels are projected to the top-down map using the mapper architecture – resulting in a top-down RGB image of the environment.
Conclusion
  • Agents perform semantic segmentation on each egocentric frame and project the resulting labels using the mapper architecture to create the top-down segmentation.
  • These results are in the same order of magnitude as the stateof-the-art methods submitted to the Habitat Challenge, suggesting that the memory tensor contains useful spatial and semantic information in this pre-exploration setting.
  • Considering a preexploration setting, the agent first navigates the environment on a fixed trajectory to generate the spatial memory tensor.
Summary
  • Imagine yourself receiving a tour of a new environment. Maybe you visit a friend’s new house and they show you around (‘This is the living room, and down here is the study’).
  • T t t t+1 t+1 t+1 (a) Agent Trajectory (b) Egocentric Observations (c) Spatial Memory (d) Top-down Segmentation perform new tasks in these spaces.
  • At one end, are SSpSpapatiataitaliaFl lFeFeaeatuaturteurerMeMaMapap@p@@ttt approaches (Sengupta et al 2012; Sunderhauf et al 2016; Maturana et al 2018a) that first perform egocentric semantic segmentation and use the known camera pose and the depth of each pixel to project labels to an allocentric map.
  • This is accomplished by projecting egocentric features to appropriate locations in an allocentric spatial memory, and using this memory to decode a top-down semantic segmentation.
  • These works focus on spatial memories as part of an end-to-end agent for a downstream task and do not evaluate the quality of the generated maps in terms of environment semantics directly.
  • Like Semantic MapNet, these approaches project intermediate features into a spatial memory and decode segmentations from that structure.
  • – A Map Decoder that uses the accumulated memory tensor to produce top-down semantic segmentations.
  • The authors choose the Matterport3D scans (Chang et al 2017) with the Habitat simulator (Savva et al 2019) over video segmentation datasets for a number of reasons – Matterport3D provides semantic annotations in 3D, the spaces are large enough to allow multi-room traversal by the agent (as opposed to (Dai et al 2017; Nathan Silberman and Fergus 2012)), and the use of a 3D simulator (as opposed to (Geiger, Lenz, and Urtasun 2012; Cordts et al 2016)) allows them render RGB-D from any viewpoint, create top-down semantic annotations, and study embodied AI applications in the same environments.
  • As depicted in Fig. 2, there exists a spectrum of methodologies for the task based on what is being projected from egocentric observations to the top-down map – pixels, features, or labels.
  • As agents traverse the scene, the observed RGB pixels are projected to the top-down map using the mapper architecture – resulting in a top-down RGB image of the environment.
  • Agents perform semantic segmentation on each egocentric frame and project the resulting labels using the mapper architecture to create the top-down segmentation.
  • These results are in the same order of magnitude as the stateof-the-art methods submitted to the Habitat Challenge, suggesting that the memory tensor contains useful spatial and semantic information in this pre-exploration setting.
  • Considering a preexploration setting, the agent first navigates the environment on a fixed trajectory to generate the spatial memory tensor.
Tables
  • Table1: Results on top-down semantic segmentation on the Matterport3D and Replica datasets. Models have not be trained on Replica and those results are purely transfer experiments. SMNet outperforms the baselines on mIoU and BF1 for Matterport3D and mIoU in Replica. left) shows a summary of the results with bootstrapped standard error (see supplement for category-level breakdowns). right) shows a summary of the results with bootstrapped standard error. Similarly, SMNet performs best on the mIoU metric at 43.12. These results demonstrate that an approach which interleaves projective geometry and learning can provide more robust allocentric semantic representations
  • Table2: Train/val/test environments for Matterport3D scenes (<a class="ref-link" id="cChang_et+al_2017_a" href="#rChang_et+al_2017_a">Chang et al 2017</a>) in our dataset
  • Table3: Test environments for Replica scenes (<a class="ref-link" id="cStraub_et+al_2019_a" href="#rStraub_et+al_2019_a">Straub et al 2019</a>)
  • Table4: Category-level performances of SMNet and baseline approaches in the Matterport3D test set
  • Table5: Category-level performances of SMNet and baseline approaches in the Replica dataset. Note that the fireplace category is not present in the Replica dataset
  • Table6: Results of SMNet on top-down semantic segmentation on the Matterport3D dataset under different settings. Here we experiment with different egocentric features extracted at different stages in RedNet. We also vary the number of channels in the memory tensor
  • Table7: Results of the Seg. → Proj. baseline on top-down semantic segmentation on the Matterport3D dataset under different settings
  • Table8: ObjectNav results comparison with an A* planner using the ground truth free space maps
  • Table9: ObjectNav results on a subset of episodes (771) from the validation set of the ObjectNav Habitat challenge (hab 2020). This table compares performances of our A* planner when the ground truth free space maps and the ground-truth semantic maps are provided
  • Table10: Per house breakdown results on the validation set of the ObjectNav Habitat challenge (hab 2020)
  • Table11: Per house breakdown results on the validation set of the ObjectNav Habitat challenge (hab 2020) with ground-truth free space maps
Download tables as Excel
Related work
  • Spatial Episodic Memories for Embodied Agents. Building and dynamically updating a spatial memory is a powerful inductive bias that has been studied in many embodied settings. Most SLAM systems perform localization by registration to sets of localized keypoint features (MurArtal and Tardos 2017). Many recent works in embodied AI have developed agents for navigation (Anderson et al 2019; Beeching et al 2020; Gupta et al 2017; Georgakis, Li, and Kosecka 2019; Blukis et al 2018) and localization (Henriques and Vedaldi 2018; Parisotto and Salakhutdinov 2017; Zhang et al 2017) that build 2.5D spatial memories containing deep features from egocentric observation. Like our approach, these all involve some variation of egocentric feature extraction, pin-hole camera projection, and map update mechanisms. However, these works focus on spatial memories as part of an end-to-end agent for a downstream task and do not evaluate the quality of the generated maps in terms of environment semantics directly. nor study how segmentation quality affects downstream tasks.
Funding
  • We compare our approach to a ‘prior’ baseline that answers with the most frequent answer in the training set. Our approach outperforms this baseline across the board: 27.78% vs. 20.83% on accuracy, 13.19% vs. 9.18% class-balanced accuracy and 5.35 vs. 6.98 on RMSE
Reference
  • 2020. Habitat Challenge 2020 @ Embodied AI Workshop. CVPR 2020. https://aihabitat.org/challenge/2020/.
    Locate open access versionFindings
  • Anderson, P.; Shrivastava, A.; Parikh, D.; Batra, D.; and Lee, S. 2019. Chasing Ghosts: Instruction Following as Bayesian State Tracking. In Advances in Neural Information Processing Systems (NeurIPS), 369–379.
    Google ScholarLocate open access versionFindings
  • Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sunderhauf, N.; Reid, I.; Gould, S.; and van den Hengel, A. 2018. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Armeni, I.; Sener, O.; Zamir, A. R.; Jiang, H.; Brilakis, I.; Fischer, M.; and Savarese, S. 2016. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1534–1543.
    Google ScholarLocate open access versionFindings
  • Beeching, E.; Dibangoye, J.; Simonin, O.; and Wolf, C. 2020. EgoMap: Projective mapping and structured egocentric memory for Deep RL. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD).
    Google ScholarLocate open access versionFindings
  • Blukis, V.; Misra, D.; Knepper, R. A.; and Artzi, Y. 2018. Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction. In Conference on Robot Learning, 505–518.
    Google ScholarLocate open access versionFindings
  • Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y. 201Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV) MatterPort3D dataset license available at: http://kaldir.vc.in.tum.de/matterport/MP TOS.pdf.
    Locate open access versionFindings
  • Chaplot, D. S.; Gandhi, D.; Gupta, A.; and Salakhutdinov, R. 2020a. Object Goal Navigation using Goal-Oriented Semantic Exploration. arXiv preprint arXiv:2007.00643.
    Findings
  • Chaplot, D. S.; Gandhi, D.; Gupta, S.; Gupta, A.; and Salakhutdinov, R. 2020b. Learning To Explore Using Active Neural SLAM. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Chaplot, D. S.; Jiang, H.; Gupta, S.; and Gupta, A. 2020c. Semantic Curiosity for Active Visual Learning. In ECCV.
    Google ScholarFindings
  • Cheng, R.; Wang, Z.; and Fragkiadaki, K. 2018. Geometryaware recurrent neural networks for active visual recognition. In Advances in Neural Information Processing Systems, 5081–5091.
    Google ScholarLocate open access versionFindings
  • Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Csurka, G.; Larlus, D.; Perronnin, F.; and Meylan, F. 20What is a good evaluation measure for semantic segmentation?. In BMVC, volume 27, 2013.
    Google ScholarLocate open access versionFindings
  • Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; and Nießner, M. 2017. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Das, A.; Datta, S.; Gkioxari, G.; Lee, S.; Parikh, D.; and Batra, D. 2018. Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2054–2063.
    Google ScholarLocate open access versionFindings
  • Epstein, R. A.; Patai, E. Z.; Julian, J. B.; and Spiers, H. J. 2017. The cognitive map in humans: spatial navigation and beyond. Nature Neuroscience 20(11): 1504–1513. doi:10. 1038/nn.4656. URL https://doi.org/10.1038/nn.4656.
    Locate open access versionFindings
  • Fraundorfer, F.; Engels, C.; and Nister, D. 2007. Topological mapping, localization and navigation using image collections. In IEEE/RSJ International Conference on Intelligent Robots and Systems.
    Google ScholarLocate open access versionFindings
  • Geiger, A.; Lenz, P.; and Urtasun, R. 2012. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Georgakis, G.; Li, Y.; and Kosecka, J. 20Simultaneous Mapping and Target Driven Navigation. arXiv preprint arXiv:1911.07980.
    Findings
  • Gordon, D.; Kembhavi, A.; Rastegari, M.; Redmon, J.; Fox, D.; and Farhadi, A. 2018. IQA: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4089–4098.
    Google ScholarLocate open access versionFindings
  • Grinvald, M.; Furrer, F.; Novkovic, T.; Chung, J. J.; Cadena, C.; Siegwart, R.; and Nieto, J. 2019. Volumetric instanceaware semantic mapping and 3D object discovery. IEEE Robotics and Automation Letters 4(3): 3037–3044.
    Google ScholarLocate open access versionFindings
  • Gupta, S.; Davidson, J.; Levine, S.; Sukthankar, R.; and Malik, J. 2017. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2616–2625.
    Google ScholarLocate open access versionFindings
  • He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. Mask R-CNN. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2961–2969.
    Google ScholarLocate open access versionFindings
  • Henriques, J. F.; and Vedaldi, A. 2018. Mapnet: An allocentric spatial memory for mapping environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 8476–8484.
    Google ScholarLocate open access versionFindings
  • Jiang, J.; Zheng, L.; Luo, F.; and Zhang, Z. 2018. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054.
    Findings
  • Kadian, A.; Truong, J.; Gokaslan, A.; Clegg, A.; Wijmans, E.; Lee, S.; Savva, M.; Chernova, S.; and Batra, D. 2019. Are We Making Real Progress in Simulated Environments? Measuring the Sim2Real Gap in Embodied Visual Navigation. arXiv preprint arXiv:1912.06321.
    Findings
  • Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.; Herrasti, A.; Gordon, D.; Zhu, Y.; Gupta, A.; and Farhadi, A. 2017. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474.
    Findings
  • Maturana, D.; Chou, P.-W.; Uenoyama, M.; and Scherer, S. 2018a. Real-Time Semantic Mapping for Autonomous OffRoad Navigation. In Hutter, M.; and Siegwart, R., eds., Field and Service Robotics, 335–350. Springer International Publishing. ISBN 978-3-319-67361-5.
    Google ScholarLocate open access versionFindings
  • Maturana, D.; Chou, P.-W.; Uenoyama, M.; and Scherer, S. 2018b. Real-time semantic mapping for autonomous offroad navigation. In Field and Service Robotics, 335–350. Springer.
    Google ScholarLocate open access versionFindings
  • McCormac, J.; Handa, A.; Davison, A.; and Leutenegger, S. 2017. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In 2017 IEEE International Conference on Robotics and automation (ICRA), 4628–4635. IEEE.
    Google ScholarLocate open access versionFindings
  • Mur-Artal, R.; and Tardos, J. D. 2017. Orb-slam2: An opensource slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33(5): 1255–1262.
    Google ScholarLocate open access versionFindings
  • Mattyus, G.; Wang, S.; Fidler, S.; and Urtasun, R. 2015. Enhancing Road Maps by Parsing Aerial Images Around the World. In Proc. of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Nagarajan, T.; Li, Y.; Feichtenhofer, C.; and Grauman, K. 2020. EGO-TOPO: Environment Affordances from Egocentric Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Nathan Silberman, Derek Hoiem, P. K.; and Fergus, R. 2012. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision (ECCV).
    Google ScholarLocate open access versionFindings
  • O’keefe, J.; and Nadel, L. 1978. The hippocampus as a cognitive map. Oxford: Clarendon Press.
    Google ScholarFindings
  • Pan, B.; Sun, J.; Leung, H. Y. T.; Andonian, A.; and Zhou, B. 2020. Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters 5(3): 4867–4873.
    Google ScholarLocate open access versionFindings
  • Parisotto, E.; and Salakhutdinov, R. 2017. Neural map: Structured memory for deep reinforcement learning. arXiv preprint arXiv:1702.08360.
    Findings
  • Rosinol, A.; Abate, M.; Chang, Y.; and Carlone, L. 2019. Kimera: an Open-Source Library for Real-Time MetricSemantic Localization and Mapping. arXiv preprint arXiv:1910.02490.
    Findings
  • Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; Parikh, D.; and Batra, D. 2019. Habitat: A Platform for Embodied AI Research. In Proc. of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Sengupta, S.; Sturgess, P.; Ladicky, L.; and Torr, P. H. S. 2012. Automatic dense visual semantic mapping from street-level imagery. In IEEE/RSJ International Conference on Intelligent Robots and Systems.
    Google ScholarLocate open access versionFindings
  • Singh, S.; Batra, A.; Pang, G.; Torresani, L.; Basu, S.; Paluri, M.; and Jawahar, C. V. 2018. Self-supervised Feature Learning for Semantic Segmentation of Overhead Imagery. In Proceedings of the British Machine Vision Conference (BMVC).
    Google ScholarLocate open access versionFindings
  • Song, S.; Lichtenberg, S. P.; and Xiao, J. 2015. SUN RGBD: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 567–576.
    Google ScholarLocate open access versionFindings
  • Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J. J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. 2019. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797.
    Findings
  • Sunderhauf, N.; Dayoub, F.; McMahon, S.; Talbot, B.; Schulz, R.; Corke, P.; Wyeth, G.; Upcroft, B.; and Milford, M. 2016. Place categorization and semantic mapping on a mobile robot. In IEEE International Conference on Robotics and Automation (ICRA).
    Google ScholarLocate open access versionFindings
  • Tung, H.-Y. F.; Cheng, R.; and Fragkiadaki, K. 2019. Learning spatial common sense with geometry-aware recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2595–2603.
    Google ScholarLocate open access versionFindings
  • Wijmans, E.; Datta, S.; Maksymets, O.; Das, A.; Gkioxari, G.; Lee, S.; Essa, I.; Parikh, D.; and Batra, D. 2019. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6659–6668.
    Google ScholarLocate open access versionFindings
  • Wijmans, E.; Kadian, A.; Morcos, A.; Lee, S.; Essa, I.; Parikh, D.; Savva, M.; and Batra, D. 2020. Decentralized Distributed PPO: Solving PointGoal Navigation. International Conference on Lefvarning Representations (ICLR).
    Google ScholarFindings
  • Xia, F.; R. Zamir, A.; He, Z.-Y.; Sax, A.; Malik, J.; and Savarese, S. 2018. Gibson env: real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
    Google ScholarLocate open access versionFindings
  • Yang, J.; Ren, Z.; Xu, M.; Chen, X.; Crandall, D. J.; Parikh, D.; and Batra, D. 2019. Embodied Amodal Recognition: Learning to Move to Perceive Objects. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2040–2050.
    Google ScholarLocate open access versionFindings
  • Zhang, J.; Tai, L.; Boedecker, J.; Burgard, W.; and Liu, M. 2017. Neural SLAM: Learning to explore with external memory. arXiv preprint arXiv:1706.09520.
    Findings
Your rating :
0

 

Tags
Comments