Latent Sequence Decompositions

    ICLR, Volume abs/1610.03035, 2017.

    Cited by: 38|Bibtex|Views45|Links
    EI
    Keywords:
    Connectionist Temporal ClassificationBidirectional LSTMWord Error RateWall Street Journalrecurrent neural networksMore(7+)
    Wei bo:
    Using a a deep convolutional neural network on the encoder with Latent Sequence Decompositions, we achieve 9.6% Word Error Rate

    Abstract:

    We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speec...More

    Code:

    Data:

    Introduction
    Highlights
    • Sequence-to-sequence models (Sutskever et al, 2014; Cho et al, 2014) with attention (Bahdanau et al, 2015) have been successfully applied to many applications including machine translation (Luong et al, 2015; Jean et al, 2015), parsing (Vinyals et al, 2015a), image captioning (Vinyals et al, 2015b; Xu et al, 2015) and Automatic Speech Recognition (ASR) (Chan et al, 2016; Bahdanau et al, 2016a)
    • We present the Latent Sequence Decompositions (LSD) framework
    • We found very minor differences in Word Error Rate based on the vocabulary size, for our n = {2, 3} word piece experiments we used a vocabulary size of 256 while our n = {4, 5} word piece experiments used a vocabulary size of 512
    • We find the Latent Sequence Decompositions n = 4 word piece vocabulary model to perform the best at 12.88% Word Error Rate or yielding a 12.7% relative improvement over the baseline character model
    • Using a a deep convolutional neural network on the encoder with Latent Sequence Decompositions, we achieve 9.6% Word Error Rate
    Methods
    • The authors experimented with the Wall Street Journal (WSJ) ASR task. The authors used the standard configuration of train si284 dataset for training, dev93 for validation and eval92 for test evaluation.
    • The EncodeRNN function is a 3 layer BLSTM with 256 LSTM units per-direction and 4 = 22 time factor reduction.
    • The DecodeRNN is a 1 layer LSTM with 256 LSTM units.
    • The authors used ADAM with the default hyperparameters described in (Kingma & Ba, 2015), the authors decayed the learning rate from 1e−3 to 1e−4.
    • The authors monitor the dev93 Word Error Rate (WER) until convergence and report the corresponding eval92 WER.
    Results
    • The authors' LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER.
    • When combined with a convolutional network on the encoder, the authors achieve 9.6% WER.
    • The baseline model is the unigram or character model and achieves 14.76% WER.
    • Using a a deep convolutional neural network on the encoder with LSD, the authors achieve 9.6% WER
    Conclusion
    • The authors presented the Latent Sequence Decompositions (LSD) framework.
    • LSD allows them to learn decompositions of sequences that are a function of both the input and output sequence.
    • The authors presented a biased training algorithm based on sampling valid extensions with an ǫ-greedy strategy, and an approximate decoding algorithm.
    • On the Wall Street Journal speech recognition task, the sequenceto-sequence character model baseline achieves 14.8% WER while the LSD model achieves 12.9%.
    • Using a a deep convolutional neural network on the encoder with LSD, the authors achieve 9.6% WER
    Summary
    • Introduction:

      Sequence-to-sequence models (Sutskever et al, 2014; Cho et al, 2014) with attention (Bahdanau et al, 2015) have been successfully applied to many applications including machine translation (Luong et al, 2015; Jean et al, 2015), parsing (Vinyals et al, 2015a), image captioning (Vinyals et al, 2015b; Xu et al, 2015) and Automatic Speech Recognition (ASR) (Chan et al, 2016; Bahdanau et al, 2016a).
    • The output representation is usually a fixed sequence of words (Sutskever et al, 2014; Cho et al, 2014), phonemes (Chorowski et al, 2015), characters (Chan et al, 2016; Bahdanau et al, 2016a) or even a mixture of characters and words (Luong & Manning, 2016).
    • This may be acceptable for problems such as translations, but inappropriate for tasks such as speech recognition, where segmentation should be informed by the characteristics of the inputs, such as audio
    • Methods:

      The authors experimented with the Wall Street Journal (WSJ) ASR task. The authors used the standard configuration of train si284 dataset for training, dev93 for validation and eval92 for test evaluation.
    • The EncodeRNN function is a 3 layer BLSTM with 256 LSTM units per-direction and 4 = 22 time factor reduction.
    • The DecodeRNN is a 1 layer LSTM with 256 LSTM units.
    • The authors used ADAM with the default hyperparameters described in (Kingma & Ba, 2015), the authors decayed the learning rate from 1e−3 to 1e−4.
    • The authors monitor the dev93 Word Error Rate (WER) until convergence and report the corresponding eval92 WER.
    • Results:

      The authors' LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER.
    • When combined with a convolutional network on the encoder, the authors achieve 9.6% WER.
    • The baseline model is the unigram or character model and achieves 14.76% WER.
    • Using a a deep convolutional neural network on the encoder with LSD, the authors achieve 9.6% WER
    • Conclusion:

      The authors presented the Latent Sequence Decompositions (LSD) framework.
    • LSD allows them to learn decompositions of sequences that are a function of both the input and output sequence.
    • The authors presented a biased training algorithm based on sampling valid extensions with an ǫ-greedy strategy, and an approximate decoding algorithm.
    • On the Wall Street Journal speech recognition task, the sequenceto-sequence character model baseline achieves 14.8% WER while the LSD model achieves 12.9%.
    • Using a a deep convolutional neural network on the encoder with LSD, the authors achieve 9.6% WER
    Tables
    • Table1: Wall Street Journal test eval92 Word Error Rate (WER) varying the n sized word piece vocabulary without any dictionary or language model. We compare Latent Sequence Decompositions (LSD) versus the Maximum Extension (MaxExt) decomposition. The LSD models all learn better decompositions compared to the baseline character model, while the MaxExt decomposition appears to be sub-optimal
    • Table2: Wall Street Journal test eval92 Word Error Rate (WER) results across Connectionist Temporal Classification (CTC) and Sequence-to-sequence (seq2seq) models. The Latent Sequence Decomposition (LSD) models use a n = 4 word piece vocabulary (LSD4). The Convolutional Neural Network (CNN) model is with deep residual connections, batch normalization and convolutions. The best end-to-end model is seq2seq + LSD + CNN at 9.6% WER
    • Table3: Top hypothesis comparsion between seq2seq character model, LSD word piece model and MaxExt word piece model
    Download tables as Excel
    Related work
    • Singh et al (2002); McGraw et al (2013); Lu et al (2013) built probabilistic pronunciation models for Hidden Markov Model (HMM) based systems. However, such models are still constraint to the conditional independence and Markovian assumptions of HMM-based systems.

      Connectionist Temporal Classification (CTC) (Graves et al, 2006; Graves & Jaitly, 2014) based models assume conditional independence, and can rely on dynamic programming for exact inference. Similarly, Ling et al (2016) use latent codes to generate text, and also assume conditional independence and leverage on dynamic programming for exact maximum likelihood gradients. Such models can not learn the output language if the language distribution is multimodal. Our seq2seq models makes no such Markovian assumptions and can learn multimodal output distributions. Collobert et al (2016) and Zweig et al (2016) developed extensions of CTC where they used some word pieces. However, the word pieces are only used in repeated characters and the decompositions are fixed.
    Reference
    • Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
      Findings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations, 2015.
      Google ScholarLocate open access versionFindings
    • Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. Endto-end Attention-based Large Vocabulary Speech Recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016a.
      Google ScholarLocate open access versionFindings
    • Dzmitry Bahdanau, Dmitriy Serdyuk, Philemon Brakel, Nan Rosemary Ke, Jan Chorowski, Aaron Courville, and Yoshua Bengio. Task Loss Estimation for Sequence Prediction. In International Conference on Learning Representations Workshop, 2016b.
      Google ScholarLocate open access versionFindings
    • William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016.
      Google ScholarLocate open access versionFindings
    • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural Language Processing, 2014.
      Google ScholarLocate open access versionFindings
    • Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-Based Models for Speech Recognition. In Neural Information Processing Systems, 2015.
      Google ScholarLocate open access versionFindings
    • Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2Letter: an End-to-End ConvNetbased Speech Recognition System. In arXiv:1609.03193, 2016.
      Findings
    • Alex Graves. Practical Variational Inference for Neural Networks. In Neural Information Processing Systems, 2011.
      Google ScholarLocate open access versionFindings
    • Alex Graves and Navdeep Jaitly. Towards End-to-End Speech Recognition with Recurrent Neural Networks. In International Conference on Machine Learning, 2014.
      Google ScholarLocate open access versionFindings
    • Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmiduber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In International Conference on Machine Learning, 2006.
      Google ScholarLocate open access versionFindings
    • Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid Speech Recognition with Bidirectional LSTM. In Automatic Speech Recognition and Understanding Workshop, 2013.
      Google ScholarFindings
    • Awni Hannun, Andrew Maas, Daniel Jurafsky, and Andrew Ng. First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs. In arXiv:1408.2873, 2014.
      Findings
    • Salah Hihi and Yoshua Bengio. Hierarchical Recurrent Neural Networks for Long-Term Dependencies. In Neural Information Processing Systems, 1996.
      Google ScholarLocate open access versionFindings
    • Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8): 1735–1780, November 1997.
      Google ScholarLocate open access versionFindings
    • Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On Using Very Large Target Vocabulary for Neural Machine Translation. In Association for Computational Linguistics, 2015.
      Google ScholarLocate open access versionFindings
    • Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
      Google ScholarLocate open access versionFindings
    • Jan Koutnik, Klaus Greff, Faustino Gomez, and Jurgen Schmidhuber. A Clockwork RNN. In International Conference on Machine Learning, 2014.
      Google ScholarLocate open access versionFindings
    • Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, Andrew Senior, FUmin Wang, and Phil Blunsom. Latent Predictor Networks for Code Generation. In Association for Computational Linguistics, 2016.
      Google ScholarLocate open access versionFindings
    • Liang Lu, Arnab Ghoshal, and Steve Renals. Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition. In Automatic Speech Recognition and Understanding Workshop, 2013.
      Google ScholarFindings
    • Minh-Thang Luong and Christopher Manning. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. In Association for Computational Linguistics, 2016.
      Google ScholarLocate open access versionFindings
    • Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the Rare Word Problem in Neural Machine Translation. In Association for Computational Linguistics, 2015.
      Google ScholarLocate open access versionFindings
    • Ian McGraw, Ibrahim Badr, and James Glass. Learning Lexicons From Speech Using a Pronunciation Mixture Model. IEEE Transactions on Audio, Speech, and Language Processing, 21(2), 2013.
      Google ScholarLocate open access versionFindings
    • Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannenmann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The Kaldi Speech Recognition Toolkit. In Automatic Speech Recognition and Understanding Workshop, 2011.
      Google ScholarLocate open access versionFindings
    • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence Level Training with Recurrent Neural Networks. In International Conference on Learning Representations, 2016.
      Google ScholarLocate open access versionFindings
    • Mike Schuster and Kaisuke Nakajima. Japanese and Korean Voice Search. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
      Google ScholarLocate open access versionFindings
    • Mike Schuster and Kuldip Paliwal. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45(11), 1997.
      Google ScholarLocate open access versionFindings
    • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In Association for Computational Linguistics, 2016.
      Google ScholarLocate open access versionFindings
    • Rita Singh, Bhiksha Raj, and Richard Stern. Automatic generation of subword units for speech recognition systems. IEEE Transactions on Speech and Audio Processing, 10(2), 2002.
      Google ScholarLocate open access versionFindings
    • Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to Sequence Learning with Neural Networks. In Neural Information Processing Systems, 2014.
      Google ScholarLocate open access versionFindings
    • Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. Grammar as a foreign language. In Neural Information Processing Systems, 2015a. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural
      Google ScholarLocate open access versionFindings
    • Published as a conference paper at ICLR 2017 12
      Google ScholarFindings
    Your rating :
    0

     

    Tags
    Comments