Multimodal Voice Conversion under Adverse Environment Using a Deep Convolutional Neural Network

IEEE ACCESS(2019)

引用 3|浏览15
暂无评分
摘要
This paper presents a voice conversion (VC) technique under noisy environments. Typically, VC methods use only audio information for conversion in a noiseless environment. However, existing conversion methods do not always achieve satisfactory results in an adverse acoustic environment. To solve this problem, we propose a multimodal voice conversion model based on a deep convolutional neural network (MDCNN) built by combining two convolutional neural networks (CNN) and a deep neural network (DNN) for VC under noisy environments. In the MDCNN, both the acoustic and visual information are incorporated into the voice conversion to improve its robustness in adverse acoustic conditions. The two CNNs are designed to extract acoustic and visual features, and the DNN is designed to capture the nonlinear mapping relation of source speech and target speech. Experimental results indicate that the proposed MDCNN outperforms two existing approaches in noisy environments.
更多
查看译文
关键词
Feature extraction, Visualization, Acoustics, Noise measurement, Lips, Convolutional neural nets, Audio and video feature fusion, convolutional neural network, deep learning, mel-frequency cepstral coefficients, multilayer feedforward neural networks, multimodal voice conversion, noise robustness
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要