Intra- and Inter-Action Understanding via Temporal Action Parsing

CVPR, pp. 727-736, 2020.

Cited by: 2|Bibtex|Views61|Links
EI
Keywords:
sub actiontemporal segmental networknew datasetaction recognitiondifferent actionMore(20+)
Weibo:
In Temporal Action Parsing of Olympics Sports, we provide each instance with a class label, and a high-quality temporal parsing annotation at the granularity of sub-actions, which is found to be beneficial for sophisticated action understanding

Abstract:

Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures...More

Code:

Data:

0
Introduction
  • Action understanding is a central topic in computer vision, which benefits a number of real-world applications, including video captioning [54], video retrieval [17, 41] and vision-based robotics [32].
  • Action labels provided by humans are sometimes ambiguous and inconsistent, e.g. open/close fridge are treated as the same action while pour milk/oil belong to different actions.
  • Such an issue could become severer when the authors deal with sub-actions, as compared to actions, sub-actions share more subtle differences between each other.
  • In more general cases, ensuring a consistent labeling scheme across sub-actions may be infeasible, considering the scale of a dataset
Highlights
  • Action understanding is a central topic in computer vision, which benefits a number of real-world applications, including video captioning [54], video retrieval [17, 41] and vision-based robotics [32]
  • We further develop an improved framework for temporal action parsing on Temporal Action Parsing of Olympics Sports, inspired by recently proposed Transformer [50]
  • To decompose an action instance into a set of sub-actions without knowing the possible sub-action categories, we develop a data-driven way to discover the distinct patterns of different sub-actions, as shown in Fig. 5
  • In this paper we propose a new dataset called Temporal Action Parsing of Olympics Sports, that digs into the internal structures of action instances, to encourage the exploration towards the hierarchical nature of human actions
  • In Temporal Action Parsing of Olympics Sports, we provide each instance with a class label, and a high-quality temporal parsing annotation at the granularity of sub-actions, which is found to be beneficial for sophisticated action understanding
  • With the help of automatically identified patterns, TransParser successfully reveals the internal structure of action instances, and the connections of different action categories
Methods
  • Due to the connections between temporal action parsing and other tasks, such as temporal action segmentation [6, 18, 26] and action detection [31], the authors select representative methods from these tasks and adapt them to temporal action parsing for comparison with several modifications.
  • The sub-action is detected once the output is over a certain threshold θc, e.g. 0.5.
  • Temporal action segmentation aims at labeling each frame of an action instance with a set of pre-defined sub-actions.
  • The authors select two representative methods via Iterative Soft Boundary Assignment (ISBA) [6] and Connectionist Temporal Modeling (CTM) [18].
  • The loss is changed to the sum of log likelihoods for all possible labelings, in that all k distinctive randomly sampled sub-actions could be a possible solution.
  • The authors use the simple best path decoding, i.e. concatenating the most active outputs at every timestamp
Results
  • Embed Embed pattern miner # FC, Dropout.
  • To decompose an action instance into a set of sub-actions without knowing the possible sub-action categories, the authors develop a data-driven way to discover the distinct patterns of different sub-actions, as shown in Fig. 5.
  • Each feature ft is refined by a Soft-Pattern-Strengthen (SPS) unit.
  • The SPS unit maintains a parametric pattern miner φ to learn distinct characteristics of sub-actions, which could be used to regularize the input feature, amplifying its discriminative patterns.
Conclusion
  • In this paper the authors propose a new dataset called TAPOS, that digs into the internal structures of action instances, to encourage the exploration towards the hierarchical nature of human actions.
  • In TAPOS, the authors provide each instance with a class label, and a high-quality temporal parsing annotation at the granularity of sub-actions, which is found to be beneficial for sophisticated action understanding.
  • The authors propose an improved method, TransParser, for action parsing, which is capable of identifying underlying patterns of sub-actions without knowing the categorical labels.
  • With the help of automatically identified patterns, TransParser successfully reveals the internal structure of action instances, and the connections of different action categories
Summary
  • Introduction:

    Action understanding is a central topic in computer vision, which benefits a number of real-world applications, including video captioning [54], video retrieval [17, 41] and vision-based robotics [32].
  • Action labels provided by humans are sometimes ambiguous and inconsistent, e.g. open/close fridge are treated as the same action while pour milk/oil belong to different actions.
  • Such an issue could become severer when the authors deal with sub-actions, as compared to actions, sub-actions share more subtle differences between each other.
  • In more general cases, ensuring a consistent labeling scheme across sub-actions may be infeasible, considering the scale of a dataset
  • Methods:

    Due to the connections between temporal action parsing and other tasks, such as temporal action segmentation [6, 18, 26] and action detection [31], the authors select representative methods from these tasks and adapt them to temporal action parsing for comparison with several modifications.
  • The sub-action is detected once the output is over a certain threshold θc, e.g. 0.5.
  • Temporal action segmentation aims at labeling each frame of an action instance with a set of pre-defined sub-actions.
  • The authors select two representative methods via Iterative Soft Boundary Assignment (ISBA) [6] and Connectionist Temporal Modeling (CTM) [18].
  • The loss is changed to the sum of log likelihoods for all possible labelings, in that all k distinctive randomly sampled sub-actions could be a possible solution.
  • The authors use the simple best path decoding, i.e. concatenating the most active outputs at every timestamp
  • Results:

    Embed Embed pattern miner # FC, Dropout.
  • To decompose an action instance into a set of sub-actions without knowing the possible sub-action categories, the authors develop a data-driven way to discover the distinct patterns of different sub-actions, as shown in Fig. 5.
  • Each feature ft is refined by a Soft-Pattern-Strengthen (SPS) unit.
  • The SPS unit maintains a parametric pattern miner φ to learn distinct characteristics of sub-actions, which could be used to regularize the input feature, amplifying its discriminative patterns.
  • Conclusion:

    In this paper the authors propose a new dataset called TAPOS, that digs into the internal structures of action instances, to encourage the exploration towards the hierarchical nature of human actions.
  • In TAPOS, the authors provide each instance with a class label, and a high-quality temporal parsing annotation at the granularity of sub-actions, which is found to be beneficial for sophisticated action understanding.
  • The authors propose an improved method, TransParser, for action parsing, which is capable of identifying underlying patterns of sub-actions without knowing the categorical labels.
  • With the help of automatically identified patterns, TransParser successfully reveals the internal structure of action instances, and the connections of different action categories
Tables
  • Table1: Comparison of performance on action classification using different sampling for TSN. Both the top-1 accuracy and overall accuracy is reported
  • Table2: Temporal action parsing results on the proposed TAPOS dataset measured by average F1-score
  • Table3: Temporal action parsing results of TransParser under different settings. The average F1, recall and precision are calculated across d ∈ {5, 10, · · · , 50}
  • Table4: Performances of TSN [<a class="ref-link" id="c53" href="#r53">53</a>] on action classification using different sampling schemes
Download tables as Excel
Related work
  • Datasets. Being an important task in computer vision, various datasets have been collected for action understanding, which could be roughly divided into three categories. Datasets in the first category provides only class labels, including early attempts (e.g. KTH [25], Weizmann [3], UCFSports [39], Olympic [34]) of limited scale and diversity, and succeeding benchmarks (e.g. UCF101 [45], HMDB51 [24], Sports1M [20], and Kinetics [21]) that better fit the need of deep learning methods. However, despite of increasing numbers of action instances being covered, more sophisticated annotations are not provided by these datasets. In the second category, datasets provide boundaries of actions in untrimmed videos. Specifically, videos in THUMOS’15 [15] contain action instances of 20 sport classes. And daily activities are included in ActivityNet [7] and Charades [43]. Other datasets in this category include HACS [57] and AVA [16]. Although these datasets are all annotated with temporal boundaries, they focus on the location of an action in an untrimmed video. Instead, we intend to provide boundaries inside action instances, revealing their internal structures.
Funding
  • This work is partially supported by SenseTime Collaborative Grant on Large-scale Multi-modality Analysis and the General Research Funds (GRF) of Hong Kong (No 14203518 and No 14205719)
Reference
  • Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Joint discovery of object states and manipulation actions. In ICCV, pages 2127–2136, 2017. 2
    Google ScholarLocate open access versionFindings
  • Evlampios Apostolidis and Vasileios Mezaris. Fast shot segmentation combining global and local visual descriptors. In ICASSP, pages 6583–6587. IEEE, 2014. 3
    Google ScholarLocate open access versionFindings
  • Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. In ICCV, pages 1395–1402. IEEE, 2005. 2
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 2
    Google ScholarLocate open access versionFindings
  • Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual relationships with deep relational networks. In CVPR, pages 3076–3086, 2017. 2
    Google ScholarLocate open access versionFindings
  • Li Ding and Chenliang Xu. Weakly-supervised action segmentation with iterative soft boundary assignment. In CVPR, pages 6508–6516, 2018. 1, 6
    Google ScholarLocate open access versionFindings
  • Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015. 2
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. In European Conference on Computer Vision, pages 51–67, 2012
    Google ScholarLocate open access versionFindings
  • Yazan Abu Farha and Jurgen Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, June 2012
    Google ScholarLocate open access versionFindings
  • Alireza Fathi, Xiaofeng Ren, and James M Rehg. Learning to recognize objects in egocentric activities. In CVPR, pages 3281–3288. IEEE, 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018. 2
    Findings
  • Christoph Feichtenhofer, Axel Pinz, Richard P Wildes, and Andrew Zisserman. What have we learned from deep representations for action recognition? In CVPR, pages 7844– 7853, 2018. 4
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Yixin Gao, S Swaroop Vedula, Carol E Reiley, Narges Ahmidi, Balakrishnan Varadarajan, Henry C Lin, Lingling Tao, Luca Zappella, Benjamın Bejar, David D Yuh, et al. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In MICCAI Workshop: M2CAI, volume 3, page 3, 202
    Google ScholarLocate open access versionFindings
  • A. Gorban, H. Idrees, Y.-G. Jiang, A. Roshan Zamir, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://www.thumos.info/, 2015.2
    Findings
  • Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, pages 6047–6056, 2018. 2
    Google ScholarLocate open access versionFindings
  • Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 41(6):797–819, 2011. 1
    Google ScholarLocate open access versionFindings
  • De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision, pages 137–153. Springer, 2016. 6
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456. JMLR, 2015. 5
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 1, 2
    Findings
  • Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goaldirected human activities. In CVPR, pages 780–787, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • Hilde Kuehne, Juergen Gall, and Thomas Serre. An end-toend generative framework for video segmentation and recognition. IEEE Winter Conference on Applications of Computer Vision, Mar 2016. 1
    Google ScholarLocate open access versionFindings
  • Hildegard Kuehne, Hueihan Jhuang, Estıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In International Conference on Computer Vision, pages 2556–2563. IEEE, 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • Ivan Laptev, Barbara Caputo, et al. Recognizing human actions: a local svm approach. In International Conference on Pattern Recognition, pages 32–36. IEEE, 2004. 2
    Google ScholarLocate open access versionFindings
  • Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. In CVPR, pages 156–165, 2017. 2, 6
    Google ScholarLocate open access versionFindings
  • Colin Lea, Austin Reiter, Rene Vidal, and Gregory D Hager. Segmental spatiotemporal cnns for fine-grained action segmentation. In European Conference on Computer Vision, pages 36–52. Springer, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Peng Lei and Sinisa Todorovic. Temporal deformable residual networks for action segmentation in videos. In CVPR, pages 6742–6751, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jun Li, Peng Lei, and Sinisa Todorovic. Weakly supervised energy-based learning for action segmentation. In ICCV, pages 6243–6251, 2019. 2
    Google ScholarLocate open access versionFindings
  • Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In European Conference on Computer Vision, pages 3–19, 2018. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Maja J Mataric. Sensory-motor primitives as a basis for imitation: Linking perception to action and biology to robotics. In Imitation in animals and artifacts, pages 391–422. MIT Press, 2002. 1
    Google ScholarLocate open access versionFindings
  • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. arXiv preprint arXiv:1906.03327, 2019. 1
    Findings
  • Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In European Conference on Computer Vision, pages 392–405. Springer, 2010. 2
    Google ScholarLocate open access versionFindings
  • Dan Oneata, Jakob Verbeek, and Cordelia Schmid. Action and event recognition with fisher vectors on a compact feature set. In ICCV, pages 1817–1824, 2013. 2
    Google ScholarLocate open access versionFindings
  • Hamed Pirsiavash and Deva Ramanan. Parsing videos of actions with segmental grammars. In CVPR, pages 612–619, 2014. 2
    Google ScholarLocate open access versionFindings
  • Alexander Richard, Hilde Kuehne, and Juergen Gall. Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, pages 754–763, 2017. 2
    Google ScholarLocate open access versionFindings
  • Alexander Richard, Hilde Kuehne, and Juergen Gall. Action sets: Weakly supervised action segmentation without ordering constraints. In CVPR, pages 5987–5996, 2018. 2
    Google ScholarLocate open access versionFindings
  • Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR, pages 1–8. IEEE, 2008. 2
    Google ScholarLocate open access versionFindings
  • Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. A database for fine grained activity detection of cooking activities. In CVPR, pages 1194–1201. IEEE, 2012. 1, 2
    Google ScholarLocate open access versionFindings
  • Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. Find and focus: Retrieve and localize video events with natural language queries. In European Conference on Computer Vision, pages 200–216, 2018. 1
    Google ScholarLocate open access versionFindings
  • Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • Gunnar A Sigurdsson, Gul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016. 2
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568– 576, 2014. 2
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 1, 2
    Findings
  • Sebastian Stein and Stephen J McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In ACM international joint conference on Pervasive and ubiquitous computing, pages 729– 738. ACM, 2013. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, pages 1207–1216, 2019. 1
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Gul Varol, Ivan Laptev, and Cordelia Schmid. Longterm temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1510–1517, 2018. 2
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558, 2013. 1, 2
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yu Qiao, and Xiaoou Tang. Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2):810– 822, 2013. 2
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36. Springer, 2016. 1, 4, 7
    Google ScholarLocate open access versionFindings
  • Yilei Xiong, Bo Dai, and Dahua Lin. Move forward and tell: A progressive generator of video descriptions. In European Conference on Computer Vision, pages 468–483, 2018. 1
    Google ScholarLocate open access versionFindings
  • Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, pages 5783–5792, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal pyramid network for action recognition. In CVPR, 2020. 1, 2
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Zhicheng Yan, Lorenzo Torresani, and Antonio Torralba. Hacs: Human action clips and segments dataset for recognition and temporal localization. arXiv preprint arXiv:1712.09374, 2019. 2
    Findings
  • Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In ICCV, pages 2914–2923, 2017. 1, 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments