Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space
CoRR(2024)
摘要
We train a bilingual Arabic-Hebrew language model using a transliterated
version of Arabic texts in Hebrew, to ensure both languages are represented in
the same script. Given the morphological, structural similarities, and the
extensive number of cognates shared among Arabic and Hebrew, we assess the
performance of a language model that employs a unified script for both
languages, on machine translation which requires cross-lingual knowledge. The
results are promising: our model outperforms a contrasting model which keeps
the Arabic texts in the Arabic script, demonstrating the efficacy of the
transliteration step. Despite being trained on a dataset approximately 60
smaller than that of other existing language models, our model appears to
deliver comparable performance in machine translation across both translation
directions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要