Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection

Rui Liu, Jinhua Zhang,Guanglai Gao

INFORMATION FUSION(2024)

引用 0|浏览3
暂无评分
摘要
Audio deepfake detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), and voice conversion (VC), etc., which is an emerging topic. Traditionally we read the mono signal and analyze the artifacts directly. Recently, the mono-to-binaural conversion based ADD approach has attracted increasing attention since the binaural audio signals provide a unique and comprehensive perspective on speech perception. Such method attempts tried to first convert the mono audio into binaural, then process the left and right channels respectively to discover authenticity cues. However, the acoustic information from the two channels exhibits both differences and similarities, which have not been thoroughly explored in previous research. To address this issue, we propose a new mono-to-binaural conversion based ADD framework that considers multi-space channel representation learning, termed "MSCR-ADD". Specifically, (1) the feature representations of the respective channels are learned by the channel-specific encoder and stored in the channel-specific space; (2) the feature representations capturing the difference between the two channels are learned by the channel-differential encoder and stored in the channel-differential space; (3) after which the channel-invariant encoder learn the channel commonality representations in the channel-invariant space. Note that we propose orthogonal and mutual information maximization losses to constrain the channel-specific and invariant encoders. At last, three representations from various spaces are mixed together to finalize the deepfake detection. It is worth noting that the feature representations in the channel-differential and invariant spaces unveil the differences and similarities between the two channels in binaural audio, enabling us to effectively detect artifacts in fake audio. The experimental results on four benchmark datasets demonstrate that our MSCR-ADD is superior to existing state-of-the-art approaches.
更多
查看译文
关键词
Audio deepfake detection (ADD),Mono-to-binaural conversion,Multi-space channel representation (MSCR) learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要