Supervised Detection of Conserved Motifs in DNA Sequences with Cosmo
Keles Sunduz and
J. van der Laan Mark
Additional contact information
Bembom Oliver: Division of Biostatistics, University of California, Berkeley
Keles Sunduz: Department of Statistics and Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison
J. van der Laan Mark: Division of Biostatistics, University of California, Berkeley
Statistical Applications in Genetics and Molecular Biology, 2007, vol. 6, issue 1, 1-51
A number of computational methods have been proposed for identifying transcription factor binding sites from a set of unaligned sequences that are thought to share the motif in question. We here introduce an algorithm, called cosmo, that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question. The algorithm is based on the same two-component multinomial mixture model used by MEME, with stronger reliance, however, on the likelihood principle instead of more ad-hoc criteria like the E-value. The intensity parameter in the ZOOPS and TCM models, for instance, is estimated based on a profile-likelihood approach, and the width of the unknown motif is selected based on BIC. These changes allow cosmo to outperform MEME even in the absence of any constraints, as evidenced by 2- to 3-fold greater sensitivity in some simulation studies. Additional improvements in performance can be achieved by selecting the model type (OOPS, ZOOPS, or TCM) data-adaptively or by supplying correctly specified constraints, especially if the motif appears only as a weak signal in the data. The algorithm can data-adaptively choose between working in a given constrained model or in the completely unconstrained model, guarding against the risk of supplying mis-specified constraints. Simulation studies suggest that this approach can offer 3 to 3.5 times greater sensitivity than MEME. The algorithm has been implemented in the form of a stand-alone C program as well as a web application that can be accessed at http://cosmoweb.berkeley.edu. An R package is available through Bioconductor (http://bioconductor.org).
References: Add references at CitEc
Citations: Track citations by RSS feed
Downloads: (external link)
For access to full text, subscription to the journal or payment for the individual article is required.
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:6:y:2007:i:1:n:8
Ordering information: This journal article can be ordered from
Access Statistics for this article
Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf
More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().