Transformer Ensemble for Synthesized Speech Detection.

Asilomar Conference on Signals, Systems and Computers(2023)

引用 0|浏览2
暂无评分
摘要
As voice synthesis systems and deep learning tools continue to improve, so does the possibility that synthesized speech can be used for nefarious purposes. Methods that determine if audio signals contain synthesized or authentic speech are needed. In this paper, we investigate three transformers to detect synthesized speech: Compact Convolutional Transformer (CCT), Patchout faSt Spectrogram Transformer (PaSST), and Self-Supervised Audio Spectrogram Transformer (SSAST). We show that each transformer independently detects synthesized speech well. Then, we propose an ensemble of transformers that can provide even better performance. Finally, we explore how much of an audio signal is needed for high synthesized speech detection. Evaluated on the ASVspoof2019 dataset, we demonstrate that our transformer ensemble detects synthesized speech from shorter segments of audio signals, even on a highly imbalanced dataset.
更多
查看译文
关键词
deep learning,audio forensics,synthesized speech detection,transformers,mel spectrograms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要