Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs

Shaolin Zhu, Shiwei Gu, Shangjie Li, Lin Xu,Deyi Xiong

Research Square (Research Square)(2023)

引用 0|浏览5
暂无评分
摘要
The neural machine translation (NMT), which relies on a large training data (bilingual parallel sentences, for NMT) to obtain the state-of-the-art performance, is similar with deep learning. In order to construct NMT systems, the number of parallel sentences is very important. However, these bilingual resources are scarce for many low-resource language pairs. Although several works attempt to obtain bilingual parallel data from Internet, the quality and quantity of mined bilingual corpus are limited for low-resource language pairs. To address this problem, we propose the multi-view knowledge distillation model (MvKD) that use the knowledge of high-resource language pairs transfer into low-resource languages by leveraging internal language invariant in different languages. In particular, we treat the mining bilingual parallel sentence pair task as classifying task and use the multi-view classifier to detect bilingual parallel sentence pair. For multi-view classifier, we use two views to recognize the semantic difference of two sentences: (i) word-level representations and (ii) sentence-level representations. We encode sentence-level representations to capture semantically similar of two sentences. Moreover, we encode word-level representations to capture word translations in a pair of parallel sentences to avoid the problem that semantically similar but non-parallel sentences. Experimental results demonstrate that our proposed method can significantly mine amount of bilingual corpus and improve the quality of parallel sentences. In particular, we carry out the experiments on several real-world low-resource situations and achieve excellent results.
更多
查看译文
关键词
Neural machine translation,Bilingual corpus,Low-resource language,Knowledge distillation,Deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要