Open-Vocabulary Skeleton Action Recognition with Diffusion Graph Convolutional Network and Pre-Trained Vision-Language Models

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

Cited 0|Views2
No score
This study explores unsupervised open-vocabulary skeleton action recognition, aiming at addressing inaccurate spatial matching and poor interpretability of existing GCN models. We present Skeleton-DGCFA, an approach to make feature alignment (FA) of skeleton with image modalities based on a large pre-trained vision and language (VL) model along with our new diffusion graph convolutional (DGC) skeleton encoder. The DGC comprises spatial and temporal convolutional modules, allowing for the diffusion of different graph semantic features. Skeleton-DGCFA harnesses recent large-scale VL models and extends their zero-shot capabilities to the skeleton modality by capitalizing on its natural pairing with images. The open-vocabulary zero-shot capabilities improve with the strength of the pre-trained VL model and our DGC skeleton encoder. We establish a new state-of-the-art in the zero-shot skeleton action recognition tasks, significantly surpassing the vanilla zero-shot method by 27.0% and 19.7% on NTU-60 and NTU-120, respectively.
Translated text
Key words
action recognition,diffusion graph convolution,vision and language model,open-vocabulary zero-shot learning
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined