AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies
CoRR(2024)
摘要
With the continuous improvements of deepfake methods, forgery messages have
transitioned from single-modality to multi-modal fusion, posing new challenges
for existing forgery detection algorithms. In this paper, we propose AVT2-DWF,
the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which
aims to amplify both intra- and cross-modal forgery cues, thereby enhancing
detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both
spatial characteristics and temporal dynamics of facial expressions. This is
achieved through a face transformer with an n-frame-wise tokenization strategy
encoder and an audio transformer encoder. Subsequently, it uses multi-modal
conversion with dynamic weight fusion to address the challenge of heterogeneous
information fusion between audio and visual modalities. Experiments on
DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves
state-of-the-art performance intra- and cross-dataset Deepfake detection. Code
is available at https://github.com/raining-dev/AVT2-DWF.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要