Post-clustering difference testing: Valid inference and practical considerations with applications to ecological and biological data
Benjamin Hivert,
Denis Agniel,
Rodolphe Thiébaut and
Boris P. Hejblum
Computational Statistics & Data Analysis, 2024, vol. 193, issue C
Abstract:
Clustering is part of unsupervised analysis methods that group samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are thus used for the inference process because the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the clustering process and the potential artificial differences it could create. Three novel statistical hypothesis tests are introduced, each designed to account for the clustering process. These tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations. The proposed tests were applied in two distinct contexts: animal ecology and immunology, demonstrating the relevance of the results with real datasets.
Keywords: Clustering; Double-dipping; Multimodality test; Selective inference (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S016794732300227X
Full text for ScienceDirect subscribers only.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:eee:csdana:v:193:y:2024:i:c:s016794732300227x
DOI: 10.1016/j.csda.2023.107916
Access Statistics for this article
Computational Statistics & Data Analysis is currently edited by S.P. Azen
More articles in Computational Statistics & Data Analysis from Elsevier
Bibliographic data for series maintained by Catherine Liu ().