EconPapers    
Economics at your fingertips  
 

Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length

Burden Conrad J., Jing Junmei and Wilson Susan R.

Statistical Applications in Genetics and Molecular Biology, 2011, vol. 11, issue 1, 1-28

Abstract: The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D2 may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D2* and D2c. We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D2 and D2c, and to a somewhat lesser extent D2*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.

Date: 2011
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.2202/1544-6115.1724 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:11:y:2011:i:1:n:3

Ordering information: This journal article can be ordered from
https://www.degruyter.com/journal/key/sagmb/html

DOI: 10.2202/1544-6115.1724

Access Statistics for this article

Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf

More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().

 
Page updated 2025-03-19
Handle: RePEc:bpj:sagmbi:v:11:y:2011:i:1:n:3