Cross-lingual Text Classification via Model Translation with Limited Dictionaries

ACM International Conference on Information and Knowledge Management(2016)

引用 19|浏览122
暂无评分
摘要
Cross-lingual text classification (CLTC) refers to the task of classifying documents in different languages into the same taxonomy of categories. An open challenge in CLTC is to classify documents for the languages where labeled training data are not available. Existing approaches rely on the availability of either high-quality machine translation of documents (to the languages where massively training data are available), or rich bilingual dictionaries for effective translation of trained classification models (to the languages where labeled training data are lacking). This paper studies the CLTC challenge under the assumption that neither condition is met. That is, we focus on the problem of translating classification models with highly incomplete bilingual dictionaries. Specifically, we propose two new approaches that combines unsupervised word embedding in different languages, supervised mapping of embedded words across languages, and probabilistic translation of classification models. The approaches show significant performance improvement in CLTC on a benchmark corpus of Reuters news stories (RCV1/RCV2) in English, Spanish, German, French and Chinese and an internal dataset in Uzbek, compared to representative baseline methods using conventional bilingual dictionaries or highly incomplete ones.
更多
查看译文
关键词
Multilingual Text Data,Cross-lingual Text Classification,Transfer Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要