Practical Comparable Data Collection for Low-Resource Languages via Images

arxiv(2020)

引用 2|浏览179
暂无评分
摘要
We propose a method of curating high-quality comparable training data for low-resource languages without requiring that the annotators are bilingual. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1\% of the pairs are acceptable translations, and only 2.47\% of the pairs are not a translation at all. We further establish the potential of dataset collected through our approach by experimenting on two downstream tasks -- machine translation and dictionary extraction. All code and data are made available at \url{https://github.com/madaan/PML4DC-Comparable-Data-Collection
更多
查看译文
关键词
data collection,images,low-resource
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要