Supervised Detection of Regulatory Motifs in DNA Sequences

Sunduz, Keles; van der Laan Mark, J.; Sandrine, Dudoit; Biao, Xing; B., Eisen Michael

Supervised Detection of Regulatory Motifs in DNA Sequences

Keles Sunduz, J. van der Laan Mark, Dudoit Sandrine, Xing Biao and Eisen Michael B.
Additional contact information
Keles Sunduz: Division of Biostatistics, School of Public Health, University of California, Berkeley
J. van der Laan Mark: Division of Biostatistics, School of Public Health, University of California, Berkeley
Dudoit Sandrine: Division of Biostatistics, School of Public Health, University of California, Berkeley
Xing Biao: Division of Biostatistics, School of Public Health, University of California, Berkeley
Eisen Michael B.: Department of Molecular and Cell Biology, University of California, Berkeley; Life Sciences Division, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley

Statistical Applications in Genetics and Molecular Biology, 2003, vol. 2, issue 1, 40

Abstract: Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs motif sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from structural characteristics of protein-DNA interactions. We extend the simple multinomial mixture model to a constrained multinomial mixture model by incorporating constraints on the information content profiles or on specific parameters of the motif PWMs. The parameters of this extended model are estimated by maximum likelihood using a nonlinear constraint optimization method. Likelihood-based cross-validation is used to select model parameters such as motif width and constraint type. The performance of COMODE is compared with existing motif detection methods on simulated data that incorporate real motif examples from Saccharomyces cerevisiae. The proposed method is especially effective when the motif of interest appears as a weak signal in the data. Some of the transcription factor binding data of Lee et al. (2002) were also analyzed using COMODE and biologically verified sites were identified.

Keywords: DNA sequence; co-regulated genes; transcription factor; regulatory motif; mixture model; position weight matrix; structured motif; information content; entropy; nonlinear constraint maximization (search for similar items in EconPapers)
Date: 2003
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)

Downloads: (external link)
https://doi.org/10.2202/1544-6115.1015 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:2:y:2003:i:1:n:5

Ordering information: This journal article can be ordered from
https://www.degruyter.com/journal/key/sagmb/html

DOI: 10.2202/1544-6115.1015

Access Statistics for this article

Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf

More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().