On The Existence Of Obstinate Results In Vector Space Models

IR(2010)

引用 111|浏览58
暂无评分
摘要
The vector space model (VSM) is a popular and widely applied model in information retrieval (IR). VSM creates vector spaces whose dimensionality is usually high (e.g., tens of thousands of terms). This may cause various problems, such as susceptibility to noise and difficulty in capturing the underlying semantic structure, which are commonly recognized as different aspects of the "curse of dimensionality." In this paper, we investigate a novel aspect of the dimensionality curse, which is referred to as hubness and manifested by the tendency of some documents (called hubs) to be included in unexpectedly many search result lists. Hubness may impact VSM considerably since hubs can become obstinate results, irrelevant to a large number of queries, thus harming the performance of an IR system and the experience of its users. We analyze the origins of hubness, showing it is primarily a consequence of high (intrinsic) dimensionality of data, and not a result of other factors such as sparsity and skewness of the distribution of term frequencies. We describe the mechanisms through which hubness emerges by exploring the behavior of similarity measures in high-dimensional vector spaces. Our consideration begins with the classical VSM (tf-idf term weighting and cosine similarity), but the conclusions generalize to more advanced variations, such as Okapi BM25. Moreover, we explain why hubness may not be easily mitigated by dimensionality reduction, and propose a similarity adjustment scheme that takes into account the existence of hubs. Experimental results over real data indicate that significant improvement can be obtained through consideration of hubness.
更多
查看译文
关键词
Text retrieval,vector space model,nearest neighbors,curse of dimensionality,hubs,similarity concentration,cosine similarity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要