Let'S Play Music: Audio-Driven Performance Video Generation

Hao Zhu,Yi Li,Feixia Zhu,Aihua Zheng,Ran He

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)（2020）

引用 3|浏览29

暂无评分

摘要

We propose a new task named Audio-driven Performance Video Generation (APVG), which aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip. It is a challenging task to generate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to generate realistic and synchronized performance video from given music. Firstly, we provide both global appearance and local spatial information by generating the coarse videos and keypoints of body and hands from a given music respectively. Then, we propose to transform the generated keypoints to heatmap via a differentiable space transformer, since the heatmap provides more spatial information but is harder to generate directly from audio. Finally, we propose a Structured Temporal UNet (STU) to extract both intra-frame structured information and interframe temporal consistency. They are obtained via graph-based structure module, and CNN-GRU based high-level temporal module respectively for final video generation. Comprehensive experiments validate the effectiveness of our proposed framework.

查看译文

关键词

intra-frame structured information,interframe temporal consistency,final video generation,audio-driven performance video generation,given music audio clip,high-dimensional temporal consistent videos,low-dimensional audio modality,realistic performance video,synchronized performance video,global appearance,local spatial information,coarse videos,generated keypoints

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要