SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps
Manu Setty and
Christina S Leslie
PLOS Computational Biology, 2015, vol. 11, issue 5, 1-21
Abstract:
Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel de novo motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a k-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at http://cbio.mskcc.org/public/Leslie/SeqGL/.Author Summary: Transcriptional regulation is the cell’s primary mode of controlling gene expression. Transcription factors (TFs) are proteins that recognize and bind specific DNA sequence signals to regulate the expression of target genes. Recent years have seen the rapid development of genome-wide assays to profile the binding locations of a single TF or, more generally, regions of open chromatin that are occupied by a complex repertoire of DNA binding factors. New methods are therefore needed to detect and represent DNA sequence signals in these genome-wide regulatory element maps. Here we present a novel tool called SeqGL to extract multiple TF binding signals from genome-wide maps. SeqGL employs a machine learning framework to identify features that best discriminate the peaks, where we expect DNA sequence signals to occur, from the flank regions that should not contain these signals. Our tool performed significantly better than widely used motif discovery methods in discriminative accuracy and achieved higher sensitivity in detecting the numerous sequence signals underlying regulatory element maps.
Date: 2015
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004271 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 04271&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1004271
DOI: 10.1371/journal.pcbi.1004271
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().