Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

    ICASSP, pp. 4960-4964, 2016.

    Cited by: 928|Bibtex|Views102|Links
    EI
    Keywords:
    end modelprobability distributionpyramidal recurrentcharacter sequencerecurrent networkMore(19+)
    Wei bo:
    William Chan, Navdeep Jaitly, Quoc Le, Oriol Vinyals Abstract— We present Listen, Attend and Spell, a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers

    Abstract:

    We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end...More

    Code:

    Data:

    Introduction
    • William Chan, Navdeep Jaitly, Quoc Le, Oriol Vinyals Abstract— The authors present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers.
    • In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it an end-to-end trained system but an end-to-end model.
    Highlights
    • William Chan, Navdeep Jaitly, Quoc Le, Oriol Vinyals Abstract— We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers
    • The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence
    Results
    • In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence.
    • The authors' system has two components: a listener and a speller.
    • The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs.
    • The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence.
    Conclusion
    • On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams.
    • The state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.
    Summary
    • William Chan, Navdeep Jaitly, Quoc Le, Oriol Vinyals Abstract— The authors present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers.
    • In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it an end-to-end trained system but an end-to-end model.
    • In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence.
    • The authors' system has two components: a listener and a speller.
    • The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs.
    • The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence.
    • On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams.
    • The state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.
    Your rating :
    0

     

    Tags
    Comments