Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data
Antoine Godichon-Baggioni,
Cathy Maugis-Rabusseau and
Andrea Rau
Journal of Applied Statistics, 2019, vol. 46, issue 1, 47-65
Abstract:
Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.
Date: 2019
References: Add references at CitEc
Citations:
Downloads: (external link)
http://hdl.handle.net/10.1080/02664763.2018.1454894 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:taf:japsta:v:46:y:2019:i:1:p:47-65
Ordering information: This journal article can be ordered from
http://www.tandfonline.com/pricing/journal/CJAS20
DOI: 10.1080/02664763.2018.1454894
Access Statistics for this article
Journal of Applied Statistics is currently edited by Robert Aykroyd
More articles in Journal of Applied Statistics from Taylor & Francis Journals
Bibliographic data for series maintained by Chris Longhurst ().