Synthetic Speech Detection through Audio Folding

MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation(2023)

引用 2|浏览19
暂无评分
摘要
In the field of synthetic speech generation, recent advancements in deep learning and speech synthesis methods have enabled the possibility of creating highly realistic fake speech tracks that are difficult to distinguish from real ones. Since the malicious use of these data can lead to dangerous consequences, the audio forensics community has focused on developing synthetic speech detectors to determine the authenticity of speech tracks. In this work we focus on the wide class of detectors that analyze audio streams on a frame-by-frame basis. We propose a technique to reduce the inference time of these detectors by relying on the fact that it is possible to mix multiple audio frames in a single one (i.e., in the same way a mono track is obtained from a stereo one). We test the proposed audio folding technique on speech tracks obtained from the ASVspoof 2019 dataset. The technique proves effective with both entirely and partially fake speech tracks and shows remarkable results, reducing processing time down to 25%.
更多
查看译文
关键词
Audio Forensics, Synthetic Speech, Digital signal processing, Audio Folding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要