Variable Selection for Mixed Data Clustering: Application in Human Population Genomics
Matthieu Marbac (),
Mohammed Sedki and
Tienne Patin
Additional contact information
Matthieu Marbac: Ensai
Mohammed Sedki: University of Paris-Sud
Tienne Patin: Institut Pasteur
Journal of Classification, 2020, vol. 37, issue 1, No 8, 124-142
Abstract:
Abstract Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.
Keywords: Human evolutionary genetics; Information criterion; Mixed data; Model-based clustering; Variable selection (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s00357-018-9301-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:jclass:v:37:y:2020:i:1:d:10.1007_s00357-018-9301-y
Ordering information: This journal article can be ordered from
http://www.springer. ... hods/journal/357/PS2
DOI: 10.1007/s00357-018-9301-y
Access Statistics for this article
Journal of Classification is currently edited by Douglas Steinley
More articles in Journal of Classification from Springer, The Classification Society
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().