Cross-species regulatory sequence activity prediction
David R Kelley
PLOS Computational Biology, 2020, vol. 16, issue 7, 1-27
Abstract:
Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.Author summary: Human population genetic studies have highlighted thousands of genomic sites that correlate with traits and diseases that do not modify gene sequences directly, but instead modify where and when those genes are expressed. To better understand how these sites influence traits and diseases, and consider their relevance for drug development, we need better models for how DNA sequences determine gene expression. Recently, machine learning algorithms based on deep artificial neural networks have proven to be promising tools toward this end. In this work, we improve upon the state of the art model accuracy by combining training data from both humans and mice. Using these models, we can predict the effect of a genetic variant on gene expression in any tissue or cell type with available data. We further demonstrate that predictions for human variants derived from mouse training datasets are highly informative and offer unique insight into the genetic basis of gene expression and disease.
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008050 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 08050&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1008050
DOI: 10.1371/journal.pcbi.1008050
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().