Similarity-Based Estimation of Word Cooccurrence Probabilities.

Ido Dagan,Fernando C Pereira,Lillian Lee

meeting of the association for computational linguistics（1994）

引用 212|浏览43

暂无评分

摘要

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on "most similar" words.We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

查看译文

关键词

probabilistic word association model,statistical nlp method,natural language,unseen word combination,word cooccurrence probability,back-off model,unseen bigrams,unseen word bigrams,similarity-based estimation,word combination,distributional word similarity,probability estimate,speech recognition,statistical significance,natural language processing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要