Sparse Action Tube Detection.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society(2024)

引用 0|浏览5
暂无评分
摘要
Action tube detection is a challenging task as it requires not only to locate action instances in each frame, but also link them in time. Existing action tube detection methods often employ multi-stage pipelines with complex designs and time-consuming linking procedure. In this paper, we present a simple end-to-end action tube detection method, termed as Sparse Tube Detector (STDet). Unlike those dense action detectors, our core idea is to use a set of learnable tube queries and directly decode them into action tubes (i.e., a set of tracked boxes with action label) from video content. This sparse detection paradigm shares several advantages. First, the large number of hand-crafted anchor candidates in dense action detectors is greatly reduced to a small number of learnable tubes, which results in a more efficient detection framework. Second, our learnable tube queries directly attend the whole video content, which endows our method with the capacity of capturing long-range information for action detection. Finally, our action detector is an end-to-end tube detection without requiring the linking procedure, which directly and explicitly predicts the action boundary instead of depending on the linking strategy. Extensive experiments shows that our STDet outperforms the previous state-of-the-art methods on two challenging untrimmed video action detection datasets of UCF101-24 and MultiSports. We hope our method will be an simple end-to-end tube detection baseline and can inspire new ideas in this direction.
更多
查看译文
关键词
Spatio-temporal Action Detection,Sparse Action Detector,Action Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要