Context-aware and Scale-insensitive Temporal Repetition Counting

CVPR, pp. 667-675, 2020.

Cited by: 0|Bibtex|Views24|Links
EI
Keywords:
fine cycletime scaleart methoddifferent actionperiodic motion detectionMore(16+)
Weibo:
To tackle the challenges posed by the diverse cyclelengths between videos and within repetitions, we propose a coarse-to-fine cycle refinement scheme

Abstract:

Temporal repetition counting aims to estimate the number of cycles of a given repetitive action. Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life. In this paper, we tailor a context-aware and scale-insensitive framework, to tack...More

Code:

Data:

0
Introduction
  • Human activities are commonly involved repetitive actions. Temporal repetition counting is a problem that aims to count the number of repetitive actions in a video [7, 14, 21, 26].
  • The repetition analysis is explored as an auxiliary cue to other video analysis applications, such as cardiac and respiratory signal recover [16], pedestrian detection [22], 3D reconstruction [15, 24], and camera calibration [11]
  • This is a challenging problem as repetitive actions exhibit inherently different action patterns.
  • The problem of detecting these two repetitions is that their cycle lengths varied largely, and is invalid to make restricted assumptions about the time-scale of the cycle length across actions
Highlights
  • Human activities are commonly involved repetitive actions
  • Temporal repetition counting is a problem that aims to count the number of repetitive actions in a video [7, 14, 21, 26]
  • The repetition analysis is explored as an auxiliary cue to other video analysis applications, such as cardiac and respiratory signal recover [16], pedestrian detection [22], 3D reconstruction [15, 24], and camera calibration [11]
  • We present a novel context-aware and scale-insensitive framework for temporal repetition counting
  • To tackle the challenges posed by the diverse cyclelengths between videos and within repetitions, we propose a coarse-to-fine cycle refinement scheme
  • We further propose a context-aware regression network to learn contextual features for recognizing previous and future repetitions
Results
  • To show the process of coarse-to-fine refinement, the authors visualize the prediction of the 1st stage, 3rd stage and the 5th stage over a video from QUVA Repetition dataset in Figure 5.
  • The authors set the each repetition prediction equal to the rounded mean value of the cycle length from the closet sampled position.
  • The authors can find that the authors give an identical estimation to all the positions in stage 1 since
Conclusion
  • The authors present a novel context-aware and scale-insensitive framework for temporal repetition counting.
  • Instead of detecting the repetition with fixed time-scales, the authors search the time-scale with a wide range locally at the beginning and refine the scales for each temporal location in a coarse-tofine manner.
  • The proposed network is designed to extract the context-aware features from two consecutive repetitions, and a anchor-based backend is tailored for detecting double-error or half-error.
  • The proposed temporal repetition counting framework is evaluated and compared with state-of-the-art methods and achieves better results in the existing benchmarks as well as the newly proposed dataset
Summary
  • Introduction:

    Human activities are commonly involved repetitive actions. Temporal repetition counting is a problem that aims to count the number of repetitive actions in a video [7, 14, 21, 26].
  • The repetition analysis is explored as an auxiliary cue to other video analysis applications, such as cardiac and respiratory signal recover [16], pedestrian detection [22], 3D reconstruction [15, 24], and camera calibration [11]
  • This is a challenging problem as repetitive actions exhibit inherently different action patterns.
  • The problem of detecting these two repetitions is that their cycle lengths varied largely, and is invalid to make restricted assumptions about the time-scale of the cycle length across actions
  • Results:

    To show the process of coarse-to-fine refinement, the authors visualize the prediction of the 1st stage, 3rd stage and the 5th stage over a video from QUVA Repetition dataset in Figure 5.
  • The authors set the each repetition prediction equal to the rounded mean value of the cycle length from the closet sampled position.
  • The authors can find that the authors give an identical estimation to all the positions in stage 1 since
  • Conclusion:

    The authors present a novel context-aware and scale-insensitive framework for temporal repetition counting.
  • Instead of detecting the repetition with fixed time-scales, the authors search the time-scale with a wide range locally at the beginning and refine the scales for each temporal location in a coarse-tofine manner.
  • The proposed network is designed to extract the context-aware features from two consecutive repetitions, and a anchor-based backend is tailored for detecting double-error or half-error.
  • The proposed temporal repetition counting framework is evaluated and compared with state-of-the-art methods and achieves better results in the existing benchmarks as well as the newly proposed dataset
Tables
  • Table1: Dataset statistic of YTsegments [<a class="ref-link" id="c14" href="#r14">14</a>], QUVA Repetition [<a class="ref-link" id="c25" href="#r25">25</a>] and the proposed UCFRep. Our dataset is larger than the previous datasets in terms of the number of videos, total duration and number of annotations. The wide range of cycle length between videos and large variation within the video also indicate that our benchmark is more challenges. The cycle variation is the average value of the absolute difference between minimum and maximum cycle length divided by the average cycle length
  • Table2: Comparison with the existing methods on YTsegments, QUVA Repetition and UCFRep for temporal repetition counting. The method with ∗ is the re-implementation version by us trained on our UCFRep benchmark
  • Table3: Ablation study of the proposed coarse-to-fine refinement method on the UCFRep benchmark validation set
  • Table4: Ablation study of the proposed context-aware estimation network on the UCFRep benchmark validation set
  • Table5: Performance variations with respect to different action classes on the UCFRep benchmark validation set
Download tables as Excel
Related work
  • A typical solution for temporal repetition counting is to transfer the motion field into one-dimensional signals, and then they try to recover the repetition structure from the signal period [1, 13, 19, 20, 30]. The mainstream of these methods obtains repetition frequency with Fourier analysis [2, 3, 7, 21]. In addition, they detect the cycle by filtering [4], peak detection [29], classification [8], and singular value decomposition [6]. The above methods assume that the estimating repetition is periodic, so that they cannot handle the non-stationary repetitions. A recent work [26] addresses this limitation, and propose a novel inference scheme to detect non-stationary actions. However, they only adopt the motion field to extract features for analysis, while ignoring context-dependency in semantic domain.
Funding
  • The work is supported by NSFC (Grant No 61772206, U1611461, 61472145, 61702194, 61972162), Guangdong R&D key project of China (Grant No 2018B010107003, 2020B010165004, 2020B010166003), Guangdong Highlevel personnel program (Grant No 2016TQ03X319), Guangdong NSF (Grant No 2017A030311027), Guangzhou Key Project in Industrial Technology (Grant No 201802010027, 201802010036), and the CCF-Tencent Open Research fund (CCF-Tencent RAGR20190112)
Reference
  • A Branzan Albu, Robert Bergevin, and Sebastien Quirion. Generic temporal segmentation of cyclic human motion. Pattern Recognition, 41(1):6–21, 2008.
    Google ScholarLocate open access versionFindings
  • Ousman Azy and Narendra Ahuja. Segmentation of periodically moving objects. In ICPR, pages 1–4, 2008.
    Google ScholarLocate open access versionFindings
  • Alexia Briassouli and Narendra Ahuja. Extraction and analysis of multiple periodic motions in video sequences. IEEE TPAMI, 29(7):1244–1261, 2007.
    Google ScholarLocate open access versionFindings
  • Gertjan J Burghouts and J-M Geusebroek. Quasi-periodic spatiotemporal filtering. IEEE TIP, 15(6):1572–1582, 2006.
    Google ScholarLocate open access versionFindings
  • Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
    Google ScholarLocate open access versionFindings
  • Dmitry Chetverikov and Sandor Fazekas. On motion periodicity of dynamic textures. In BMVC, pages 167–176, 2006.
    Google ScholarLocate open access versionFindings
  • Ross Cutler and Larry S. Davis. Robust real-time periodic motion detection, analysis, and applications. IEEE TPAMI, 22(8):781–796, 2000.
    Google ScholarLocate open access versionFindings
  • James Davis, Aaron Bobick, and Whitman Richards. Categorical representation and recognition of oscillatory motion patterns. In CVPR, pages 628–635, 2000.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, pages 6546–6555, 2018.
    Google ScholarLocate open access versionFindings
  • Shiyao Huang, Xianghua Ying, Jiangpeng Rong, Zeyu Shang, and Hongbin Zha. Camera calibration from periodic motion of a pedestrian. In CVPR, pages 3025–3033, 2016.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Ivan Laptev, Serge J Belongie, Patrick Perez, and Josh Wills. Periodic motion detection and segmentation via approximate sequence alignment. In ICCV, pages 816–823, 2005.
    Google ScholarLocate open access versionFindings
  • Ofir Levy and Lior Wolf. Live repetition counting. In ICCV, pages 3020–3028, 2015.
    Google ScholarLocate open access versionFindings
  • Xiu Li, Hongdong Li, Hanbyul Joo, Yebin Liu, and Yaser Sheikh. Structure from recurrent motion: From rigidity to recurrency. In CVPR, pages 3032–3040, 2018.
    Google ScholarLocate open access versionFindings
  • Xiaoxiao Li, Vivek Singh, Yifan Wu, Klaus Kirchberg, James Duncan, and Ankur Kapoor. Repetitive motion estimation network: Recover cardiac and respiratory signal from thoracic imaging. arXiv preprint arXiv:1811.03343, 2018.
    Findings
  • Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV, pages 3–19, 2018.
    Google ScholarLocate open access versionFindings
  • Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In CVPR, June 2019.
    Google ScholarLocate open access versionFindings
  • ChunMei Lu and Nicola J Ferrier. Repetitive motion analysis: Segmentation and event classification. IEEE TPAMI, 26(2):258–263, 2004.
    Google ScholarLocate open access versionFindings
  • Costas Panagiotakis, Giorgos Karvounas, and Antonis Argyros. Unsupervised detection of periodic segments in videos. In ICIP, pages 923–927, 2018.
    Google ScholarLocate open access versionFindings
  • Erik Pogalin, Arnold WM Smeulders, and Andrew HC Thean. Visual quasi-periodicity. In CVPR, pages 1–8, 2008.
    Google ScholarLocate open access versionFindings
  • Yang Ran, Isaac Weiss, Qinfen Zheng, and Larry S Davis. Pedestrian detection via periodic motion analysis. IJCV, 71(2):143–160, 2007.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Evan Ribnick and Nikolaos Papanikolopoulos. 3d reconstruction of periodic motion from a single view. IJCV, 90(1):28–44, 2010.
    Google ScholarLocate open access versionFindings
  • Tom FH Runia, Cees GM Snoek, and Arnold WM Smeulders. Real-world repetition estimation by div, grad and curl. In CVPR, pages 9009–9017, 2018.
    Google ScholarLocate open access versionFindings
  • Tom FH Runia, Cees GM Snoek, and Arnold WM Smeulders. Repetition estimation. IJCV, 127(9):1361–1383, 2019.
    Google ScholarLocate open access versionFindings
  • Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058, 2016.
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
    Findings
  • Ashwin Thangali and Stan Sclaroff. Periodic motion detection and estimation via space-time sampling. In WACV, pages 176–182, 2005.
    Google ScholarLocate open access versionFindings
  • Christopher J Tralie and Jose A Perea. (quasi) periodicity quantification in video data, using topology. SIAM Journal on Imaging Sciences, 11(2):1049–1077, 2018.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments