Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering.

ACM Trans. Asian & Low-Resource Lang. Inf. Process.(2019)

引用 9|浏览163
暂无评分
摘要
Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information, and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this article, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating the inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results on multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.
更多
查看译文
关键词
Semantic relatedness, multilingual text analysis, short text analysis, text clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要