Bayesian multiple logistic regression for case-control GWAS

Banerjee, Saikat; Zeng, Lingyao; Schunkert, Heribert; Söding, Johannes

Bayesian multiple logistic regression for case-control GWAS

Saikat Banerjee, Lingyao Zeng, Heribert Schunkert and Johannes Söding

PLOS Genetics, 2018, vol. 14, issue 12, 1-27

Abstract: Genetic variants in genome-wide association studies (GWAS) are tested for disease association mostly using simple regression, one variant at a time. Standard approaches to improve power in detecting disease-associated SNPs use multiple regression with Bayesian variable selection in which a sparsity-enforcing prior on effect sizes is used to avoid overtraining and all effect sizes are integrated out for posterior inference. For binary traits, the logistic model has not yielded clear improvements over the linear model. For multi-SNP analysis, the logistic model required costly and technically challenging MCMC sampling to perform the integration. Here, we introduce the quasi-Laplace approximation to solve the integral and avoid MCMC sampling. We expect the logistic model to perform much better than multiple linear regression except when predicted disease risks are spread closely around 0.5, because only close to its inflection point can the logistic function be well approximated by a linear function. Indeed, in extensive benchmarks with simulated phenotypes and real genotypes, our Bayesian multiple LOgistic REgression method (B-LORE) showed considerable improvements (1) when regressing on many variants in multiple loci at heritabilities ≥ 0.4 and (2) for unbalanced case-control ratios. B-LORE also enables meta-analysis by approximating the likelihood functions of individual studies by multivariate normal distributions, using their means and covariance matrices as summary statistics. Our work should make sparse multiple logistic regression attractive also for other applications with binary target variables. B-LORE is freely available from: https://github.com/soedinglab/b-lore.Author summary: In recent years, genome wide association studies (GWAS) have become the primary approach for identifying genetic variants associated with the origination of complex diseases. In case-control GWAS, the genotypes of roughly equal number of diseased (“cases”) and healthy (“controls”) people are compared to determine which genetic variants are significantly more frequent among cases. From the disease-associated variants we hope to get insights into how the disease develops. To find the disease-associated variants, a linear relationship between the disease risk and the number of minor alleles at the variant sites has usually been assumed, because the more appropriate sigmoid relationship requires slow and cumbersome sampling techniques. We found an efficient analytical approximation that renders sampling unnecessary and makes our multiple logistic regression model easy to train. We show that it outperforms the usually employed multiple linear regression model whenever nonlinearities become strong, which is the case, for example, when the numbers of case and control patients differ significantly. Therefore, novel genetic disease-associated variants could be found by adding controls to existing case-control GWAS and reanalyzing them with B-LORE.

Date: 2018
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007856 (text/html)
https://journals.plos.org/plosgenetics/article/fil ... 07856&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pgen00:1007856

DOI: 10.1371/journal.pgen.1007856

Access Statistics for this article

More articles in PLOS Genetics from Public Library of Science
Bibliographic data for series maintained by plosgenetics ().