On Training Targets for Supervised Speech Separation.
IEEE/ACM Transactions on Audio, Speech & Language Processing, no. 12 (2014): 1849-1858
Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM)...More
PPT (Upload PPT)
- S PEECH separation, which is the task of separating speech from a noisy mixture, has major applications, such as robust automatic speech recognition (ASR), hearing aids design, and mobile speech communication.
- Monaural speech separation is perhaps most desirable from the application standpoint.
- Manuscript received February 16, 2014; revised May 30, 2014; accepted August 21, 2014.
- Date of publication August 28, 2014; date of current version September 13, 2014.
- S PEECH separation, which is the task of separating speech from a noisy mixture, has major applications, such as robust automatic speech recognition (ASR), hearing aids design, and mobile speech communication
- The ideal binary mask (IBM) is used as the training target for supervised speech separation
- The compared targets can be categorized into binary masking based (IBM and target binary mask (TBM)), ratio masking based (IRM and FFT-MASK), and spectral envelope based (FFT-MAG and Gammatone filterbank power spectra (GF)-POW) targets
- We found that binary masking leads to slightly worse objective intelligibility results than ratio masking
- An unexpected finding of this study is that the direct prediction of spectral envelopes produces the worst results, as best illustrated by the substantial performance gap between FFT-MAG and FFT-MASK, where the two targets are essentially two alternative views of the same underlying goal, the clean speech magnitude
- Aside from the analysis presented in Section V-B, which points to the issue of nonlinear compression, we believe that masking has several advantages over spectral envelope estimation
- Since log magnitude is not bounded, the authors use linear output units in the DNNs. the authors use percent normalization, which linearly scales the data to the range of [0,1].
- The authors normalize the magnitudes by first performing a log compression followed by percent normalization, and use sigmoidal output units.
- The authors believe log + percent normalization performs better because it preserves spectral details while simultaneously making the target bounded
- The authors use this normalization scheme when predicting spectral magnitude/energy based targets in the remaining experiments
- CONCLUDING REMARKS
Choosing a suitable training target is critical for supervised learning, as it is directly related to the underlying computational goal.
- The authors found that binary masking leads to slightly worse objective intelligibility results than ratio masking
- This is likely because predicting ratio targets is less sensitive to estimation errors than predicting binary targets.
- An unexpected finding of this study is that the direct prediction of spectral envelopes produces the worst results, as best illustrated by the substantial performance gap between FFT-MAG and FFT-MASK, where the two targets are essentially two alternative views of the same underlying goal, the clean speech magnitude.
- Ideal masks are likely easier to learn than spectral envelopes, as their spectrotemporal patterns are more stable with respect to speaker variations
- Table1: PERFORMANCE ON FACTORY1 WHEN THE CLEAN MAGNITUDES ARE
- Table2: GENERALIZATION PERFORMANCE ON TWO UNSEEN NOISES AT dB
- Table3: PERFORMANCE COMPARISONS BETWEEN VARIOUS TARGETS AND SYSTEMS ON
- Table4: PERFORMANCE COMPARISONS BETWEEN VARIOUS TARGETS AND SYSTEMS ON 5 DB MIXTURES
- Table5: GENERALIZATION PERFORMANCE ON TWO UNSEEN NOISES AT 0 DB
- Table6: PERFORMANCE COMPARISONS BETWEEN VARIOUS TARGETS AND SYSTEMS ON 0 DB MIXTURES
- Table7: GENERALIZATION PERFORMANCE ON TWO UNSEEN NOISES AT 5 DB
- This work was supported in part by the Air force Office of Scientific Research (AFOSR) under Grant FA9550-12-1-0130, the National Institute on Deafness and Other Communication (NIDCD) under Grant R01 DC012048, a Small Business Technology Transfer (STTR) subcontract from Kuzer, and the Ohio Supercomputer Center
Study subjects and analysis
utterances from unseen speakers: 192
We use 2000 randomly chosen utterances from the TIMIT  training set as our training utterances. The TIMIT core test set, which consists of 192 utterances from unseen speakers of both genders, is used as the test set. We use SSN and 4 other noises from the NOISEX dataset  as our training and test noises
- M. Anzalone, L. Calandruccio, K. Doherty, and L. Carney, “Determination of the potential benefit of time-frequency gain manipulation,” Ear Hear., vol. 27, no. 5, pp. 480–492, 2006.
- D. Brungart, P. Chang, B. Simpson, and D. Wang, “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Amer., vol. 120, pp. 4007–4018, 2006.
- C. Chen and J. Bilmes, “MVA processing of speech features,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 257–270, Jan. 2007.
- J. Chen, Y. Wang, and D. Wang, “A feature study for classificationbased speech separation at very low signal-to-noise ratio,” in Proc. ICASSP, 2014, pp. 7059–7063.
- J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., pp. 2121–2159, 2011.
- J. Erkelens, R. Hendriks, R. Heusdens, and J. Jensen, “Minimum meansquare error estimation of discrete fourier coefficients with generalized gamma priors,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1741–1752, Aug. 2007.
- J. Garofolo, DARPA TIMIT acoustic-phonetic continuous speech corpus. Gaithersburg, MD, USA: Nat. Inst. of Standards Technol., 1993.
- X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier networks,” in Proc. 14th Int. Conf. Artif. Intell. Statist. JMLR W&CP Volume, 2011, vol. 15, pp. 315–323.
- C. Gulcehre and Y. Bengio, “Knowledge matters: Importance of prior information for optimization,” in Proc. Int. Conf. Learn. Representat. (ICLR), 2013.
- K. Han and D. Wang, “A classification based approach to speech segregation,” J. Acoust. Soc. Amer., vol. 132, pp. 3475–3483, 2012.
- K. Han, Y. Wang, and D. Wang, “Learning spectral mapping for speech dereverberation,” in Proc. ICASSP, 2014, pp. 4648–4652.
- E. Healy, S. Yoho, Y. Wang, and D. Wang, “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. Acous. Soc. Amer., pp. 3029–3038, 2013.
- R. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise PSD tracking with low complexity,” in Proc. ICASSP, 2010, pp. 4266–4269.
- J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, “Super-human multi-talker speech recognition: A graphical modeling approach,” Comput. Speech Lang., pp. 45–66, 2010.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
- Z. Jin and D. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 625–638, May 2009.
- G. Kim, Y. Lu, Y. Hu, and P. Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. Acoust. Soc. Amer., pp. 1486–1494, 2009.
- U. Kjems, J. Boldt, M. Pedersen, T. Lunner, and D. Wang, “Role of mask pattern in intelligibility of ideal binary-masked noisy speech,” J. Acoust. Soc. Amer., vol. 126, pp. 1415–1426, 2009.
- N. Li and P. Loizou, “Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction,” J. Acoust. Soc. Amer., vol. 123, no. 3, pp. 1673–1682, 2008.
- Y. Li and D. Wang, “On the optimality of ideal binary time–frequency masks,” Speech Commun., pp. 230–239, 2009.
- P. C. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL, USA: CRC, 2007.
- N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement approaches using nonnegative matrix factorization,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2140–2151, Oct. 2013.
- A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in Proc. ICASSP, 2013, pp. 7092–7096.
- A. Narayanan and D. Wang, “The role of binary mask patterns in automatic speech recognition in background noise,” J. Acoust. Soc. Amer., pp. 3083–3093, 2013.
- R. Plomp, The Intelligent Ear: On the Nature of Sound Perception. Mahwah, NJ, USA: Lawrence Erlbaum Associates, 2002.
- A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1766–1776, Aug. 2007.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752.
- S. Srinivasan, N. Roman, and D. Wang, “Binary and ratio time-frequency masks for robust speech recognition,” Speech Commun., vol. 48, no. 11, pp. 1486–1501, 2006.
- C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, Sep. 2011.
- A. Varga and H. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, pp. 247–251, 1993.
- T. Virtanen, J. Gemmeke, and B. Raj, “Active-set Newton algorithm for overcomplete non-negative representations of audio,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 11, pp. 2277–2289, Nov. 2013.
- D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, P. Divenyi, Ed. Norwell, MA, USA: Kluwer, 2005, pp. 181–197.
- D. Wang, U. Kjems, M. Pedersen, J. Boldt, and T. Lunner, “Speech intelligibility in background noise with ideal binary time-frequency masking,” J. Acoust. Soc. Amer., vol. 125, pp. 2336–2347, 2009.
- Y. Wang, K. Han, and D. Wang, “Exploring monaural features for classification-based speech segregation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 2, pp. 270–279, Feb. 2013.
- Y. Wang and D. Wang, “Cocktail party processing via structured prediction,” in Proc. NIPS, 2012, pp. 224–232.
- Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Jul. 2013.
- Y. Wang and D. Wang, “A structure-preserving training target for supervised speech separation,” in Proc. ICASSP, 2014, pp. 6127–6131.
- J. Woodruff, “Integrating monaural and binaural cues for sound localization and segregation in reverberant environments,” Ph.D. dissertation, The Ohio State Univ., Columbus, OH, USA, 2012.
- Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Lett., vol. 21, no. 1, pp. 66–68, Jan. 2014. Yuxuan Wang received his B.E. degree in network engineering from Nanjing University of Posts and Telecommunications, Nanjing, China, in 2009. He is currently pursuing his Ph.D. degree at The Ohio State University. He is interested in speech separation, robust automatic speech recognition, and machine learning.