Automatic and manual clustering for large vocabulary speech recognition: a comparative study

EUROSPEECH(1999)

引用 21|浏览4
暂无评分
摘要
This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.
更多
查看译文
关键词
comparative study,speech recognition,classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要