Adjusting the adjusted Rand Index

Sundqvist, Martina; Chiquet, Julien; Rigaill, Guillem

Adjusting the adjusted Rand Index

Martina Sundqvist (), Julien Chiquet () and Guillem Rigaill ()
Additional contact information
Martina Sundqvist: Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA-Paris
Julien Chiquet: Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA-Paris
Guillem Rigaill: Université Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS2)

Computational Statistics, 2023, vol. 38, issue 1, No 16, 327-347

Abstract: Abstract The Adjusted Rand Index (ARI) is arguably one of the most popular measures for cluster comparison. The adjustment of the ARI is based on a hypergeometric distribution assumption which is not satisfactory from a modeling point of view because (i) it is not appropriate when the two clusterings are dependent, (ii) it forces the size of the clusters, and (iii) it ignores the randomness of the sampling. In this work, we present a new "modified" version of the Rand Index. First, as in Russell et al. (J Malar Inst India 3(1), 1940 ), we consider only the pairs consistent by similarity and ignore the pairs consistent by difference to define the MRI. Second, we base the adjusted version, called MARI, on a multinomial distribution instead of a hypergeometric distribution. The multinomial model is advantageous because it does not force the size of the clusters, correctly models randomness and is easily extended to the dependent case. We show that ARI is biased under the multinomial model and that the difference between ARI and MARI can be significant for small n but essentially vanishes for large n, where n is the number of individuals. Finally, we provide an efficient algorithm to compute all these quantities ((A)RI and M(A)RI) based on a sparse representation of the contingency table in our aricode package. The space and time complexity is linear with respect to the number of samples and, more importantly, does not depend on the number of clusters as we do not explicitly compute the contingency table.

Keywords: Clustering; Rand Index; Multinomial distribution; Statistical inference (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://link.springer.com/10.1007/s00180-022-01230-7 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:compst:v:38:y:2023:i:1:d:10.1007_s00180-022-01230-7

Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/180/PS2

DOI: 10.1007/s00180-022-01230-7

Access Statistics for this article

Computational Statistics is currently edited by Wataru Sakamoto, Ricardo Cao and Jürgen Symanzik

More articles in Computational Statistics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().