AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake.

IEEE Trans. Inf. Forensics Secur.(2023)

引用 27|浏览60
暂无评分
摘要
Recently, deepfakes have raised severe concerns about the authenticity of online media. Prior works for deepfake detection have made many efforts to capture the intra-modal artifacts. However, deepfake videos in real-world scenarios often consist of a combination of audio and visual. In this paper, we propose an Audio-Visual Joint Learning for Detecting Deepfake (AVoiD-DF), which exploits audio-visual inconsistency for multi-modal forgery detection. Specifically, AVoiD-DF begins by embedding temporal-spatial information in Temporal-Spatial Encoder. A Multi-Modal Joint-Decoder is then designed to fuse multi-modal features and jointly learn inherent relationships. Afterward, a Cross-Modal Classifier is devised to detect manipulation with inter-modal and intra-modal disharmony. Since existing datasets for deepfake detection mainly focus on one modality and only cover a few forgery methods, we build a novel benchmark DefakeAVMiT for multi-modal deepfake detection. DefakeAVMiT contains sufficient visuals with corresponding audios, where any one of the modalities may be maliciously modified by multiple deepfake methods. The experimental results on DefakeAVMiT, FakeAVCeleb, and DFDC demonstrate that the AVoiD-DF outperforms many state-of-the-arts in deepfake detection. Our proposed method also yields superior generalization on various forgery techniques.
更多
查看译文
关键词
Deepfake detection,multi-modal,audio-visual,joint learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要