Sparse latent factor regression models for genome-wide and epigenome-wide association studies
Jumentier Basile,
Caye Kevin,
Heude Barbara,
Lepeule Johanna () and
François Olivier ()
Additional contact information
Jumentier Basile: Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France
Caye Kevin: Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France
Heude Barbara: Institut National de la Santé et de la Recherche Médicale, Centre of Research in Epidemiology and Statistics, INSERM UMR 1153, Université de Paris, F75004 Paris, France
Lepeule Johanna: Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Institute for Advanced Biosciences, INSERM U 1209, CNRS UMR 5309, Université Grenoble-Alpes, Grenoble, 38000, France
François Olivier: Centre National de la Recherche Scientifique, Grenoble INP, TIMC-IMAG CNRS UMR 5525, Université Grenoble-Alpes, Grenoble, 38000, France
Statistical Applications in Genetics and Molecular Biology, 2022, vol. 21, issue 1, 19
Abstract:
Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.
Keywords: confounding factors; epigenome-wide association; genome-wide association; sparse model; statistical methods (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1515/sagmb-2021-0035 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:21:y:2022:i:1:p:19:n:2
Ordering information: This journal article can be ordered from
https://www.degruyter.com/journal/key/sagmb/html
DOI: 10.1515/sagmb-2021-0035
Access Statistics for this article
Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf
More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().