Adjusting the adjusted Rand Index

arxiv(2022)

引用 0|浏览20
暂无评分
摘要
The Adjusted Rand Index ( ARI ) is arguably one of the most popular measures for cluster comparison. The adjustment of the ARI is based on a hypergeometric distribution assumption which is not satisfactory from a modeling point of view because (i) it is not appropriate when the two clusterings are dependent, (ii) it forces the size of the clusters, and (iii) it ignores the randomness of the sampling. In this work, we present a new "modified" version of the Rand Index. First, as in Russell et al. (J Malar Inst India 3(1), 1940 ), we consider only the pairs consistent by similarity and ignore the pairs consistent by difference to define the MRI . Second, we base the adjusted version, called MARI , on a multinomial distribution instead of a hypergeometric distribution. The multinomial model is advantageous because it does not force the size of the clusters, correctly models randomness and is easily extended to the dependent case. We show that ARI is biased under the multinomial model and that the difference between ARI and MARI can be significant for small n but essentially vanishes for large n , where n is the number of individuals. Finally, we provide an efficient algorithm to compute all these quantities (( A ) RI and M ( A ) RI ) based on a sparse representation of the contingency table in our aricode package. The space and time complexity is linear with respect to the number of samples and, more importantly, does not depend on the number of clusters as we do not explicitly compute the contingency table.
更多
查看译文
关键词
Clustering,Rand Index,Multinomial distribution,Statistical inference
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要