Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model

Väinö, Jääskinen; Ville, Parkkinen; Lu, Cheng; Jukka, Corander

Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model

Jääskinen Väinö (), Parkkinen Ville, Cheng Lu and Corander Jukka
Additional contact information
Jääskinen Väinö: Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland
Parkkinen Ville: Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland
Cheng Lu: Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland
Corander Jukka: Department of Mathematics and Statistics, University of Helsinki, Åbo Akademi University, FI-20500, Åbo, Finland

Statistical Applications in Genetics and Molecular Biology, 2014, vol. 13, issue 1, 105-121

Abstract: In many biological applications it is necessary to cluster DNA sequences into groups that represent underlying organismal units, such as named species or genera. In metagenomics this grouping needs typically to be achieved on the basis of relatively short sequences which contain different types of errors, making the use of a statistical modeling approach desirable. Here we introduce a novel method for this purpose by developing a stochastic partition model that clusters Markov chains of a given order. The model is based on a Dirichlet process prior and we use conjugate priors for the Markov chain parameters which enables an analytical expression for comparing the marginal likelihoods of any two partitions. To find a good candidate for the posterior mode in the partition space, we use a hybrid computational approach which combines the EM-algorithm with a greedy search. This is demonstrated to be faster and yield highly accurate results compared to earlier suggested clustering methods for the metagenomics application. Our model is fairly generic and could also be used for clustering of other types of sequence data for which Markov chains provide a reasonable way to compress information, as illustrated by experiments on shotgun sequence type data from an Escherichia coli strain.

Keywords: Clustering; DNA sequences; Markov chains; metagenomics (search for similar items in EconPapers)
Date: 2014
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1515/sagmb-2013-0031 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:13:y:2014:i:1:p:105-121:n:7

Ordering information: This journal article can be ordered from
https://www.degruyte ... urnal/key/sagmb/html

DOI: 10.1515/sagmb-2013-0031

Access Statistics for this article

Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf

More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().