A Generic Query-Based Model for Scalable Clustering

msra(2006)

引用 24|浏览4
暂无评分
摘要
This paper presents a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the relevant-set clustering (RSC) model relies solely on the existence of an oracle that accepts a query in the form of a data item, and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a commercial search engine whose ranking function and relevancy scores are kept secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. A scalable clustering heuristic based on the RSC model is also presented, and demonstrated for very large, high-dimensional datasets using a fast approximate similarity search structure as the oracle.
更多
查看译文
关键词
similarity search,statistical significance,search engine
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要