Beyond MOT: Semantic Multi-Object Tracking
arxiv(2024)
摘要
Current multi-object tracking (MOT) aims to predict trajectories of targets
(i.e.,"where") in videos. Yet, knowing merely "where" is insufficient in many
crucial applications. In comparison, semantic understanding such as
fine-grained behaviors, interactions, and overall summarized captions (i.e.,
"what") from videos, associated with "where", is highly-desired for
comprehensive video analysis. Thus motivated, we introduce Semantic
Multi-Object Tracking (SMOT), that aims to estimate object trajectories and
meanwhile understand semantic details of associated trajectories including
instance captions, instance interactions, and overall video captions,
integrating "where" and "what" for tracking. In order to foster the exploration
of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT.
Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various
scenarios for semantic tracking of humans. BenSMOT provides annotations for the
trajectories of targets, along with associated instance captions in natural
language, instance interactions, and overall caption for each video sequence.
To our best knowledge, BenSMOT is the first publicly available benchmark for
SMOT. Besides, to encourage future research, we present a novel tracker named
SMOTer, which is specially designed and end-to-end trained for SMOT, showing
promising performance. By releasing BenSMOT, we expect to go beyond
conventional MOT by predicting "where" and "what" for SMOT, opening up a new
direction in tracking for video understanding. Our BenSMOT and SMOTer will be
released.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要