Cross-language information retrieval with latent topic models trained on a comparable corpus

INFORMATION RETRIEVAL TECHNOLOGY(2011)

引用 13|浏览0
暂无评分
摘要
In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The probabilistic interlingual representation is incorporated in a statistical language model for information retrieval. Experiments performed on the English and Dutch test datasets of the CLEF 2001-2003 CLIR campaigns show the competitive performance of our approach compared to cross-language retrieval methods that rely on pre-existing translation dictionaries that are hand-built or constructed based on parallel corpora.
更多
查看译文
关键词
comparable corpus,cross-language information retrieval,information retrieval,language pair,interlingual representation,retrieval method,latent topic model,pre-existing translation dictionary,bilingual topic model,bilingual latent dirichlet allocation,statistical language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要