Geometric Consistency-Guaranteed Spatio-Temporal Transformer for Unsupervised Multi-View 3D Pose Estimation
IEEE Transactions on Instrumentation and Measurement(2024)
IoT Perception Mine Research Center
Abstract
Unsupervised 3D pose estimation has gained prominence due to the challenges in acquiring labeled 3D data for training. Despite promising progress, unsupervised approaches still lag behind supervised methods in performance. Two factors impede the progress of unsupervised approaches: incomplete geometric constraint and inadequate interaction among spatial, temporal, and multi-view features. This paper introduces an unsupervised pipeline that uses calibrated camera parameters as geometric constraints across views and coordinate spaces to optimize the model by minimizing inconsistencies between the 2D input pose and the re-projection of the predicted 3D pose. This pipeline utilizes the novel Hierarchical Cross Transformer (HCT) to encode higher levels of information by enabling interactions among hierarchical features containing different level of temporal, spatial and cross-view information. By minimizing the reliance on human-specific parts, the HCT shows potential for adapting to various pose estimation tasks. To validate the adaptability, we build a connection between human pose estimation and scene pose estimation, introducing Dynamic-Keypoints-3D (DK-3D) dataset tailored for 3D Scene Pose Estimation in robotic manipulation. Experiments on two 3D human pose estimation datasets demonstrate our method’s new state-of-the-art performance among weakly and unsupervised approaches. The adaptability of our method is confirmed through experiments on DK-3D, setting the initial benchmark for unsupervised 2D-to-3D scene pose lifting.
MoreTranslated text
Key words
Pose Estimation,Multi-view,Transformer
求助PDF
上传PDF
View via Publisher
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
- Pretraining has recently greatly promoted the development of natural language processing (NLP)
- We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
- We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
- The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
- Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Upload PDF to Generate Summary
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined