Audio-Driven Talking Video Frame Restoration

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 2|浏览55
暂无评分
摘要
Talking video frames occasionally drop while streaming for reasons like network errors, which greatly hurts the online team collaboration and user experiences. Directly generating the dropped frames from the remaining ones is unfavorable since a person's lip motion is usually non-linear and thus hard to be restored when consecutive frames are missing. Nevertheless, the audio content provides strong signals for lip motion and is less likely to drop during transmitting. Inspired by this, as an initial attempt, we present the task of audio-driven talking video frame restoration in this paper, i.e., restoring dropped video frames by jointly leveraging the audio and remaining video frames. Towards the high-quality frame generation, we devise a cross-modal frame restoration network. This network aligns the complete audio content with video frames, precisely identifies and sequentially generates the dropped frames. To justify our model, we construct a new dataset, Talking Video Frames Drop, TVFD for short, consisting of 2.5K video and 144K frames in total. We conduct extensive experiments over TVFD and another publicly accessible dataset - Voxceleb2. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
更多
查看译文
关键词
Streaming media,Faces,Lips,Task analysis,Image restoration,Visualization,Synchronization,Frame Restoration,Frame-Dropped Video,Cross-Modal Learning,Dynamic Programming,Generative Adversial Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要