EthioMT: Parallel Corpus for Low-resource Ethiopian Languages
arxiv(2024)
摘要
Recent research in natural language processing (NLP) has achieved impressive
performance in tasks such as machine translation (MT), news classification, and
question-answering in high-resource languages. However, the performance of MT
leaves much to be desired for low-resource languages. This is due to the
smaller size of available parallel corpora in these languages, if such corpora
are available at all. NLP in Ethiopian languages suffers from the same issues
due to the unavailability of publicly accessible datasets for NLP tasks,
including MT. To help the research community and foster research for Ethiopian
languages, we introduce EthioMT – a new parallel corpus for 15 languages. We
also create a new benchmark by collecting a dataset for better-researched
languages in Ethiopia. We evaluate the newly collected corpus and the benchmark
dataset for 23 Ethiopian languages using transformer and fine-tuning
approaches.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要