N-gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation.
EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing(2008)
摘要
In domains with insufficient matched training data, language models are often constructed by interpolating component models trained from partially matched corpora. Since the n -grams from such corpora may not be of equal relevance to the target domain, we propose an n -gram weighting technique to adjust the component n -gram probabilities based on features derived from readily available segmentation and metadata information for each corpus. Using a log-linear combination of such features, the resulting model achieves up to a 1.2% absolute word error rate reduction over a linearly interpolated baseline language model on a lecture transcription task.
更多查看译文
关键词
component n-gram,interpolating component model,language model,linearly interpolated baseline language,n-gram weighting technique,resulting model,absolute word error rate,available segmentation,equal relevance,lecture transcription task,N-gram weighting,cross-domain language model estimation,training data mismatch
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络