Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation.

ICSCA(2020)

引用 0|浏览1
暂无评分
摘要
For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要