Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection.

IH&MMSec(2023)

引用 0|浏览19
暂无评分
摘要
Many speech signals are compressed with MP3 to reduce the data rate. In many synthetic speech detection methods the spectrogram of the speech signal is used. This usually requires the speech signal to be fully decompressed. We show that the design of MP3 compression allows one to approximate the spectrogram of the MP3 compressed speech efficiently without fully decoding the compressed speech. We denote the spectograms obtained using our proposed approach by Efficient Spectrograms (E-Specs). E-Spec can reduce the complexity of spectrogram computation by similar to 77.60 percentage points (p.p.) and save similar to 37.87 p.p. of MP3 decoding time. E-Spec bypasses the reconstruction artifacts introduced by the MP3 synthesis filterbank, which makes it useful in speech forensics tasks. We tested E-Spec in the synthetic speech detection, where a detector is asked to determine whether a speech signal is synthesized or recorded from a human. We examined 4 different neural network architectures to evaluate the performance of E-Spec compared to speech features extracted from the fully decoded speech signal. E-Spec achieved the best synthetic speech detection performance for 3 architectures; it also achieved the best overall detection performance across architectures. The computation of E-Spec is an approximation to Short Time Fourier Transform (STFT). E-Spec can be extended to other audio compression methods.
更多
查看译文
关键词
synthetic speech detection, deep learning, signal processing, MP3 compression, audio compression, ASVspoof19
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要