Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

IEEE Sensors Journal(2023)

引用 3|浏览17
暂无评分
摘要
Self-supervised monocular depth and visual odometry (VO) are often cast as coupled tasks. Accurate depth contributes to precise pose estimation and vice versa. Existing architectures typically exploit stacking convolution layers and long short-term memory (LSTM) units to capture long-range dependencies. However, their intrinsic locality hinders the model from getting the expected performance gain. In this article, we propose a Transformer-based architecture, named Transformer-based self-supervised monocular depth and VO (TSSM-VO), to tackle these problems. It comprises two main components: 1) a depth generator that leverages the powerful capability of multihead self-attention (MHSA) on modeling long-range spatial dependencies and 2) a pose estimator built upon a Transformer to learn long-range temporal correlations of image sequences. Moreover, a new data augmentation loss based on structural similarity (SSIM) is introduced to constrain further the structural similarity between the augmented depth and the augmented predicted depth. Rigorous ablation studies and exhaustive performance comparison on the KITTI and Make3D datasets demonstrate the superiority of TSSM-VO over other self-supervised methods. We expect that TSSM-VO would enhance the ability of intelligent agents to understand the surrounding environments.
更多
查看译文
关键词
Data augmentation loss,long-range dependencies,monocular depth estimation,multihead self-attention (MHSA),visual odometry (VO)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要