A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical MT.

MT Summit, Octobre(2020)

引用 8|浏览5
暂无评分
摘要
The standard procedure to train the translation model of a phrase-based SMT system is to concatenate all available parallel data, to perform word alignment, to extract phrase pairs and to calculate translation probabilities by simple relative frequency. However, parallel data is quite inhomogeneous in many practical applications with respect to several factors like data source, alignment quality, appropriateness to the task, etc. We propose a general framework to take into account these factors during the calculation of the phrase-table, e.g. by better distributing the probability mass of the individual phrase pairs. No additional feature functions are needed. We report results on two well-known tasks: the IWSLT’11 and WMT’11 evaluations, in both conditions translating from English to French. We give detailed results for different functions to weight the bitexts. Our best systems improve a strong baseline by up to one BLEU point without any impact on the computational complexity during training or decoding.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要