Identifying clusters in genomics data by recursive partitioning

Gro, Nilsen; Ørnulf, Borgan; Knut, LiestØl; Christian, Lingjærde Ole

Identifying clusters in genomics data by recursive partitioning

Nilsen Gro, Borgan Ørnulf, LiestØl Knut and Lingjærde Ole Christian ()
Additional contact information
Nilsen Gro: Biomedical Informatics, Department of Informatics, University of Oslo, Norway; and Centre for Cancer Biomedicine, University of Oslo, Norway
Borgan Ørnulf: Department of Mathematics, University of Oslo, Norway
LiestØl Knut: Biomedical Informatics, Department of Informatics, University of Oslo, Norway; and Centre for Cancer Biomedicine, University of Oslo, Norway
Lingjærde Ole Christian: Biomedical Informatics, Department of Informatics, University of Oslo, Norway; and Centre for Cancer Biomedicine, University of Oslo, Norway K.G. Jebsen Centre for Breast Cancer Research, Oslo University Hospital, Oslo, Norway

Statistical Applications in Genetics and Molecular Biology, 2013, vol. 12, issue 5, 637-652

Abstract: Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).

Keywords: cluster analysis; gene expression; genomics; recursion; subclusters (search for similar items in EconPapers)
Date: 2013
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1515/sagmb-2013-0016 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:12:y:2013:i:5:p:637-652:n:6

Ordering information: This journal article can be ordered from
https://www.degruyte ... urnal/key/sagmb/html

DOI: 10.1515/sagmb-2013-0016

Access Statistics for this article

Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf

More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().