A Morphologically-Aware Dictionary-based Data Augmentation Technique for Machine Translation of Under-Represented Languages
CoRR(2024)
摘要
The availability of parallel texts is crucial to the performance of machine
translation models. However, most of the world's languages face the predominant
challenge of data scarcity. In this paper, we propose strategies to synthesize
parallel data relying on morpho-syntactic information and using bilingual
lexicons along with a small amount of seed parallel data. Our methodology
adheres to a realistic scenario backed by the small parallel seed data. It is
linguistically informed, as it aims to create augmented data that is more
likely to be grammatically correct. We analyze how our synthetic data can be
combined with raw parallel data and demonstrate a consistent improvement in
performance in our experiments on 14 languages (28 English <-> X pairs) ranging
from well- to very low-resource ones. Our method leads to improvements even
when using only five seed sentences and a bilingual lexicon.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要