Deciphering the Structures of Genomic DNA Sequences Using Recurrence Time Statistics
Jian-Bo Gao (),
Yinhe Cao () and
Wen-wen Tung ()
Additional contact information
Jian-Bo Gao: University of Florida
Wen-wen Tung: National Center for Atmospheric Research
A chapter in Data Mining in Biomedicine, 2007, pp 321-337 from Springer
Abstract:
Abstract The completion of the human genome and genomes of many other organisms calls for the development of faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. Such tools are even more important for sequencing uncompleted genomes of many other organisms, such as floro- and neuro- genomes. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index can also be derived from the recurrence time statistics, which has two salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected expressed sequence tag belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Our method only requires approximately 6 · N byte memory and a computational time of N log N to extract all the repeat-related and periodic or quasi-periodic features from a sequence of length N without any prior knowledge about the consensus sequence of those features, therefore enables us to carry out analysis of genomes on the whole genomic scale.
Keywords: Genomic DNA sequence; repeated-related structures; coding region identification; recurrence time statistics (search for similar items in EconPapers)
Date: 2007
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:spochp:978-0-387-69319-4_18
Ordering information: This item can be ordered from
http://www.springer.com/9780387693194
DOI: 10.1007/978-0-387-69319-4_18
Access Statistics for this chapter
More chapters in Springer Optimization and Its Applications from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().