KIT-Multi: A Translation-Oriented Multilingual Embedding Corpus.
LREC(2018)
摘要
Cross-lingual word embeddings are the representations of words across languages in a shared continuous vector space. Cross-lingual word embeddings have been shown to be helpful in the development of cross-lingual natural language processing tools. In case of more than two languages involved, we call them multilingual word embeddings. In this work, we introduce a multilingual word embedding corpus which is acquired by using neural machine translation. Unlike other cross-lingual embedding corpora, the embeddings can be learned from significantly smaller portions of data and for multiple languages at once. An intrinsic evaluation on monolingual tasks shows that our method is fairly competitive to the prevalent methods but on the cross-lingual document classification task, it obtains the best figures. We are in the process to produce the embeddings for more languages, especially the languages which belong to the same family or sematically close to each others, such as Japanese-Korean, Chinese-Vietnamese, German-Dutch, or Latin-based languagues. Furthermore, the corpus is being analyzedd regarding its usage and usefulness in other cross-lingual tasks.
更多查看译文
关键词
multilingual embeddings, cross-lingual embeddings, neural machine translation, multi-source translation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络