On Training Targets for Supervised Speech Separation.

IEEE/ACM Transactions on Audio, Speech & Language Processing, no. 12 (2014): 1849-1858

Cited by: 713|Views200
WOS EI

Abstract

Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM)...More

Code:

Data:

0
Introduction
  • S PEECH separation, which is the task of separating speech from a noisy mixture, has major applications, such as robust automatic speech recognition (ASR), hearing aids design, and mobile speech communication.
  • Monaural speech separation is perhaps most desirable from the application standpoint.
  • Manuscript received February 16, 2014; revised May 30, 2014; accepted August 21, 2014.
  • Date of publication August 28, 2014; date of current version September 13, 2014.
Highlights
  • S PEECH separation, which is the task of separating speech from a noisy mixture, has major applications, such as robust automatic speech recognition (ASR), hearing aids design, and mobile speech communication
  • The ideal binary mask (IBM) is used as the training target for supervised speech separation
  • The compared targets can be categorized into binary masking based (IBM and target binary mask (TBM)), ratio masking based (IRM and FFT-MASK), and spectral envelope based (FFT-MAG and Gammatone filterbank power spectra (GF)-POW) targets
  • We found that binary masking leads to slightly worse objective intelligibility results than ratio masking
  • An unexpected finding of this study is that the direct prediction of spectral envelopes produces the worst results, as best illustrated by the substantial performance gap between FFT-MAG and FFT-MASK, where the two targets are essentially two alternative views of the same underlying goal, the clean speech magnitude
  • Aside from the analysis presented in Section V-B, which points to the issue of nonlinear compression, we believe that masking has several advantages over spectral envelope estimation
Results
  • Since log magnitude is not bounded, the authors use linear output units in the DNNs. the authors use percent normalization, which linearly scales the data to the range of [0,1].
  • The authors normalize the magnitudes by first performing a log compression followed by percent normalization, and use sigmoidal output units.
  • The authors believe log + percent normalization performs better because it preserves spectral details while simultaneously making the target bounded
  • The authors use this normalization scheme when predicting spectral magnitude/energy based targets in the remaining experiments
Conclusion
  • CONCLUDING REMARKS

    Choosing a suitable training target is critical for supervised learning, as it is directly related to the underlying computational goal.
  • The authors found that binary masking leads to slightly worse objective intelligibility results than ratio masking
  • This is likely because predicting ratio targets is less sensitive to estimation errors than predicting binary targets.
  • An unexpected finding of this study is that the direct prediction of spectral envelopes produces the worst results, as best illustrated by the substantial performance gap between FFT-MAG and FFT-MASK, where the two targets are essentially two alternative views of the same underlying goal, the clean speech magnitude.
  • Ideal masks are likely easier to learn than spectral envelopes, as their spectrotemporal patterns are more stable with respect to speaker variations
Tables
  • Table1: PERFORMANCE ON FACTORY1 WHEN THE CLEAN MAGNITUDES ARE
  • Table2: GENERALIZATION PERFORMANCE ON TWO UNSEEN NOISES AT dB
  • Table3: PERFORMANCE COMPARISONS BETWEEN VARIOUS TARGETS AND SYSTEMS ON
  • Table4: PERFORMANCE COMPARISONS BETWEEN VARIOUS TARGETS AND SYSTEMS ON 5 DB MIXTURES
  • Table5: GENERALIZATION PERFORMANCE ON TWO UNSEEN NOISES AT 0 DB
  • Table6: PERFORMANCE COMPARISONS BETWEEN VARIOUS TARGETS AND SYSTEMS ON 0 DB MIXTURES
  • Table7: GENERALIZATION PERFORMANCE ON TWO UNSEEN NOISES AT 5 DB
Download tables as Excel
Funding
  • This work was supported in part by the Air force Office of Scientific Research (AFOSR) under Grant FA9550-12-1-0130, the National Institute on Deafness and Other Communication (NIDCD) under Grant R01 DC012048, a Small Business Technology Transfer (STTR) subcontract from Kuzer, and the Ohio Supercomputer Center
Study subjects and analysis
utterances from unseen speakers: 192
We use 2000 randomly chosen utterances from the TIMIT [7] training set as our training utterances. The TIMIT core test set, which consists of 192 utterances from unseen speakers of both genders, is used as the test set. We use SSN and 4 other noises from the NOISEX dataset [30] as our training and test noises

Reference
  • M. Anzalone, L. Calandruccio, K. Doherty, and L. Carney, “Determination of the potential benefit of time-frequency gain manipulation,” Ear Hear., vol. 27, no. 5, pp. 480–492, 2006.
    Google ScholarLocate open access versionFindings
  • D. Brungart, P. Chang, B. Simpson, and D. Wang, “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Amer., vol. 120, pp. 4007–4018, 2006.
    Google ScholarLocate open access versionFindings
  • C. Chen and J. Bilmes, “MVA processing of speech features,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 257–270, Jan. 2007.
    Google ScholarLocate open access versionFindings
  • J. Chen, Y. Wang, and D. Wang, “A feature study for classificationbased speech separation at very low signal-to-noise ratio,” in Proc. ICASSP, 2014, pp. 7059–7063.
    Google ScholarLocate open access versionFindings
  • J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., pp. 2121–2159, 2011.
    Google ScholarLocate open access versionFindings
  • J. Erkelens, R. Hendriks, R. Heusdens, and J. Jensen, “Minimum meansquare error estimation of discrete fourier coefficients with generalized gamma priors,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1741–1752, Aug. 2007.
    Google ScholarLocate open access versionFindings
  • J. Garofolo, DARPA TIMIT acoustic-phonetic continuous speech corpus. Gaithersburg, MD, USA: Nat. Inst. of Standards Technol., 1993.
    Google ScholarLocate open access versionFindings
  • X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier networks,” in Proc. 14th Int. Conf. Artif. Intell. Statist. JMLR W&CP Volume, 2011, vol. 15, pp. 315–323.
    Google ScholarLocate open access versionFindings
  • C. Gulcehre and Y. Bengio, “Knowledge matters: Importance of prior information for optimization,” in Proc. Int. Conf. Learn. Representat. (ICLR), 2013.
    Google ScholarLocate open access versionFindings
  • K. Han and D. Wang, “A classification based approach to speech segregation,” J. Acoust. Soc. Amer., vol. 132, pp. 3475–3483, 2012.
    Google ScholarLocate open access versionFindings
  • K. Han, Y. Wang, and D. Wang, “Learning spectral mapping for speech dereverberation,” in Proc. ICASSP, 2014, pp. 4648–4652.
    Google ScholarLocate open access versionFindings
  • E. Healy, S. Yoho, Y. Wang, and D. Wang, “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. Acous. Soc. Amer., pp. 3029–3038, 2013.
    Google ScholarLocate open access versionFindings
  • R. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise PSD tracking with low complexity,” in Proc. ICASSP, 2010, pp. 4266–4269.
    Google ScholarLocate open access versionFindings
  • J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, “Super-human multi-talker speech recognition: A graphical modeling approach,” Comput. Speech Lang., pp. 45–66, 2010.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
    Findings
  • Z. Jin and D. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 625–638, May 2009.
    Google ScholarLocate open access versionFindings
  • G. Kim, Y. Lu, Y. Hu, and P. Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. Acoust. Soc. Amer., pp. 1486–1494, 2009.
    Google ScholarLocate open access versionFindings
  • U. Kjems, J. Boldt, M. Pedersen, T. Lunner, and D. Wang, “Role of mask pattern in intelligibility of ideal binary-masked noisy speech,” J. Acoust. Soc. Amer., vol. 126, pp. 1415–1426, 2009.
    Google ScholarLocate open access versionFindings
  • N. Li and P. Loizou, “Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction,” J. Acoust. Soc. Amer., vol. 123, no. 3, pp. 1673–1682, 2008.
    Google ScholarLocate open access versionFindings
  • Y. Li and D. Wang, “On the optimality of ideal binary time–frequency masks,” Speech Commun., pp. 230–239, 2009.
    Google ScholarLocate open access versionFindings
  • P. C. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL, USA: CRC, 2007.
    Google ScholarFindings
  • N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement approaches using nonnegative matrix factorization,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2140–2151, Oct. 2013.
    Google ScholarLocate open access versionFindings
  • A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in Proc. ICASSP, 2013, pp. 7092–7096.
    Google ScholarLocate open access versionFindings
  • A. Narayanan and D. Wang, “The role of binary mask patterns in automatic speech recognition in background noise,” J. Acoust. Soc. Amer., pp. 3083–3093, 2013.
    Google ScholarLocate open access versionFindings
  • R. Plomp, The Intelligent Ear: On the Nature of Sound Perception. Mahwah, NJ, USA: Lawrence Erlbaum Associates, 2002.
    Google ScholarFindings
  • A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1766–1776, Aug. 2007.
    Google ScholarLocate open access versionFindings
  • A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752.
    Google ScholarLocate open access versionFindings
  • S. Srinivasan, N. Roman, and D. Wang, “Binary and ratio time-frequency masks for robust speech recognition,” Speech Commun., vol. 48, no. 11, pp. 1486–1501, 2006.
    Google ScholarLocate open access versionFindings
  • C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, Sep. 2011.
    Google ScholarLocate open access versionFindings
  • A. Varga and H. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, pp. 247–251, 1993.
    Google ScholarLocate open access versionFindings
  • T. Virtanen, J. Gemmeke, and B. Raj, “Active-set Newton algorithm for overcomplete non-negative representations of audio,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 11, pp. 2277–2289, Nov. 2013.
    Google ScholarLocate open access versionFindings
  • D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, P. Divenyi, Ed. Norwell, MA, USA: Kluwer, 2005, pp. 181–197.
    Google ScholarFindings
  • D. Wang, U. Kjems, M. Pedersen, J. Boldt, and T. Lunner, “Speech intelligibility in background noise with ideal binary time-frequency masking,” J. Acoust. Soc. Amer., vol. 125, pp. 2336–2347, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Wang, K. Han, and D. Wang, “Exploring monaural features for classification-based speech segregation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 2, pp. 270–279, Feb. 2013.
    Google ScholarLocate open access versionFindings
  • Y. Wang and D. Wang, “Cocktail party processing via structured prediction,” in Proc. NIPS, 2012, pp. 224–232.
    Google ScholarLocate open access versionFindings
  • Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013.
    Google ScholarLocate open access versionFindings
  • Y. Wang and D. Wang, “A structure-preserving training target for supervised speech separation,” in Proc. ICASSP, 2014, pp. 6127–6131.
    Google ScholarLocate open access versionFindings
  • J. Woodruff, “Integrating monaural and binaural cues for sound localization and segregation in reverberant environments,” Ph.D. dissertation, The Ohio State Univ., Columbus, OH, USA, 2012.
    Google ScholarFindings
  • Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Lett., vol. 21, no. 1, pp. 66–68, Jan. 2014. Yuxuan Wang received his B.E. degree in network engineering from Nanjing University of Posts and Telecommunications, Nanjing, China, in 2009. He is currently pursuing his Ph.D. degree at The Ohio State University. He is interested in speech separation, robust automatic speech recognition, and machine learning.
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科