Improving document clustering using Okapi BM25 feature weighting

INFORMATION RETRIEVAL(2011)

引用 34|浏览0
暂无评分
摘要
We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf , confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k 1 BM25 parameter when clustering. Our results indicate that typical values of k 1 from other IR tasks are not appropriate for clustering; k 1 needs to be higher.
更多
查看译文
关键词
Document clustering,Feature weighting,Okapi BM25
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要