AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper presents a novel two-model hierarchical architecture for real-time hand gesture recognition systems

Real-Time Hand Gesture Detection And Classification Using Convolutional Neural Networks

2019 14TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2019), (2019): 407-414

Cited by: 26|Views160
EI
Full Text
Bibtex
Weibo

Abstract

Real-time recognition of dynamic hand gestures from video streams is a challenging task since (i) there is no indication when a gesture starts and ends in the video, (ii) performed gestures should only be recognized once, and (iii) the entire architecture should be designed considering the memory and power budget. In this work, we address...More

Code:

Data:

0
Introduction
  • Computers and computing devices are becoming an essential part of the lives day by day.
  • The increasing demand for such computing devices increased the necessity of easy and practical computer interfaces.
  • For this reason, systems using vision-based interaction and control are becoming more common, and as a result of this, gesture recognition is getting more and more popular in research community due to various application possibilities in human machine interaction.
  • The second, on the other hand, requires an extra step of hand-keypoints extraction, which brings
Highlights
  • Computers and computing devices are becoming an essential part of our lives day by day
  • In order to provide a practical solution, we have developed a vision based gesture recognition approach using deep convolutional neural networks (CNNs) on raw video data
  • This paper presents a novel two-model hierarchical architecture for real-time hand gesture recognition systems
  • The proposed architecture provides resource efficiency, early detections and single time activations, which are critical for real-time gesture recognition applications
  • The proposed approach is evaluated on two dynamic hand gesture datasets, and achieves similar results for both of them
  • For real-time evaluation, we have proposed to use a new metric, Levenshtein accuracy, which we believe is a suitable evaluation metric since it can measure misclassifications, multiple detections and missing detections at the same time
Methods
  • The authors elaborate on the two-model hierarchical architecture that enables the-state-of-the-art CNN models to be used in real-time gesture recognition applications as efficiently as possible.
  • The authors give a detailed explanation for the used post processing strategies that allow them to have single-time activation per gesture in real-time.
  • With the availability of large datasets, CNN based models have proven their ability in action/gesture recognition tasks.
  • There is no clear description of how to use these models in a real-time dynamic system.
  • The authors aim to fill this research gap
Results

  • Considering these scenarios, the authors propose to use the Levenshtein distance as the evaluation metric for online experiments.
  • Predicted [1, 2, 7, 4, 5, 6, 6, 7, 8, 9]
  • For this example, the Levenshtein distance is 2: The deletion of one of ”6” which is detected two times, and the substitution of ”7” with ”3”.
  • The Levenshtein distance is 2: The deletion of one of ”6” which is detected two times, and the substitution of ”7” with ”3”
  • The authors average this distance over the number of true target classes.
Conclusion
  • This paper presents a novel two-model hierarchical architecture for real-time hand gesture recognition systems.
  • The proposed architecture provides resource efficiency, early detections and single time activations, which are critical for real-time gesture recognition applications.
  • The authors have applied weighted-averaging on the class probabilites over time, which improves the overall performance and allows early detection of the gestures at the same time.
  • The authors acquired single-time activation per gesture by using difference between highest two average class probabilities as a confidence measure.
Tables
  • Table1: Detector (ResNet-10) and Classifier (ResNeXt-101) architectures. For ResNet-10, max pooling is not applied when input of 8-frames is used
  • Table2: Detection results of 8-frames ResNet-10 architecture on the test set of EgoGesture dataset
  • Table3: Comparison with state-of-the-art on the test set of EgoGesture dataset
  • Table4: Detector’s binary classification accuracy scores on the test set of EgoGesture dataset
  • Table5: Detection results of 8-frames ResNet-10 architecture on the test set of nvGesture dataset
  • Table6: Comparison with state-of-the-art on the test set of nvGesture dataset
  • Table7: Classifier’s classification accuracy scores on the test set of EgoGesture dataset
  • Table8: Classifier’s classification accuracy scores on the test set of nvGesture dataset
Download tables as Excel
Related work
  • The success of CNNs in object detection and classification tasks [10], [5] has created a growing trend to apply them also in the other areas of computer vision. For video analysis tasks, CNNs have been initially extended to be applied for video action and activity recognition and they have achieved state-of-the-art performances [18], [4].

    There have been various approaches using CNNs to extract spatio-temporal information from video data. Due to the success of 2D CNNs in static images, video analysis based approaches initially applied 2D CNNs. In [18], [8], video frames are treated as multi-channel inputs to 2D CNNs. Temporal Segment Network (TSN) [22] divides video into several segments, extracts information from color and optical flow modalities for each segment using 2D CNNs, and then applies spatio-temporal modeling for action recognition. A convolutional long short-term memory (LSTM) architecture is proposed in [4], where the authors extract first the features from video frames by a 2D CNN and then apply LSTM for global temporal modeling. The strength of all these approaches comes from the fact that there are plenty of very successful 2D CNN architectures, and these architectures can be pretrained using the very large-scale ImageNet dataset [3].
Funding
  • We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research
Reference
  • https://www.twentybn.com/datasets/jester/v1.
    Findings
  • K. S. Abhishek, L. C. F. Qubeley, and D. Ho. Glove-based hand gesture recognition sign language translator using capacitive touch sensor. In 2016 IEEE International Conference on Electron Devices and Solid-State Circuits (EDSSC), pages 334–337, Aug. 2016.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255.
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
    Google ScholarLocate open access versionFindings
  • R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6546–6555, 2018.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs], Dec. 2015. arXiv: 1512.03385.
    Findings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, S. R., and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 1725–1732, 2014.
    Google ScholarLocate open access versionFindings
  • O. Kopuklu, N. Kose, and G. Rigoll. Motion fused frames: Data level fusion strategy for hand gesture recognition. arXiv preprint arXiv:1804.07187, 2018.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • D. McNeill and E. Levy. Conceptual representations in language activity and gesture. ERIC Clearinghouse Columbus, 1980.
    Google ScholarLocate open access versionFindings
  • P. Molchanov, S. Gupta, K. Kim, and K. Pulli. Multi-sensor system for driver’s hand-gesture recognition. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1–8. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4207– 4215, 2016.
    Google ScholarLocate open access versionFindings
  • P. Narayana, J. R. Beveridge, and B. A. Draper. Gesture recognition: Focus on the hands. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5235–5244, 2018.
    Google ScholarLocate open access versionFindings
  • E. Ohn-Bar and M. M. Trivedi. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE transactions on intelligent transportation systems, 15(6):2368–2377, 2014.
    Google ScholarLocate open access versionFindings
  • V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7):677– 695, 1997.
    Google ScholarLocate open access versionFindings
  • P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 3, pages 958–962, 2003.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3d Convolutional Networks. arXiv:1412.0767 [cs], Dec. 2014. arXiv: 1412.0767.
    Findings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489– 4497. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
    Findings
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36.
    Google ScholarLocate open access versionFindings
  • R. Wen, L. Yang, C.-K. Chui, K.-B. Lim, and S. Chang. Intraoperative Visual Guidance and Control Interface for Augmented Reality Robotic Surgery. In 2010 8th IEEE International Conference on Control and Automation, ICCA 2010, pages 947–952, July 2010.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, C. Cao, J. Cheng, and H. Lu. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE Transactions on Multimedia, 20(5):1038–1050, May 2018.
    Google ScholarLocate open access versionFindings
  • G. Zhu, L. Zhang, P. Shen, and J. Song. Multimodal gesture recognition using 3-d convolution and convolutional lstm. IEEE Access, 5:4517–4524, 2017.
    Google ScholarLocate open access versionFindings
Author
Okan Köpüklü
Okan Köpüklü
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科