Multimodal Transformers for Real-Time Surgical Activity Prediction
arxiv(2024)
摘要
Real-time recognition and prediction of surgical activities are fundamental
to advancing safety and autonomy in robot-assisted surgery. This paper presents
a multimodal transformer architecture for real-time recognition and prediction
of surgical gestures and trajectories based on short segments of kinematic and
video data. We conduct an ablation study to evaluate the impact of fusing
different input modalities and their representations on gesture recognition and
prediction performance. We perform an end-to-end assessment of the proposed
architecture using the JHU-ISI Gesture and Skill Assessment Working Set
(JIGSAWS) dataset. Our model outperforms the state-of-the-art (SOTA) with
89.5% accuracy for gesture prediction through effective fusion of kinematic
features with spatial and contextual video features. It achieves the real-time
performance of 1.1-1.3ms for processing a 1-second input window by relying on a
computationally efficient model.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要