Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model
Jääskinen Väinö (),
Parkkinen Ville,
Cheng Lu and
Corander Jukka
Additional contact information
Jääskinen Väinö: Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland
Parkkinen Ville: Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland
Cheng Lu: Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland
Corander Jukka: Department of Mathematics and Statistics, University of Helsinki, Åbo Akademi University, FI-20500, Åbo, Finland
Statistical Applications in Genetics and Molecular Biology, 2014, vol. 13, issue 1, 105-121
Abstract:
In many biological applications it is necessary to cluster DNA sequences into groups that represent underlying organismal units, such as named species or genera. In metagenomics this grouping needs typically to be achieved on the basis of relatively short sequences which contain different types of errors, making the use of a statistical modeling approach desirable. Here we introduce a novel method for this purpose by developing a stochastic partition model that clusters Markov chains of a given order. The model is based on a Dirichlet process prior and we use conjugate priors for the Markov chain parameters which enables an analytical expression for comparing the marginal likelihoods of any two partitions. To find a good candidate for the posterior mode in the partition space, we use a hybrid computational approach which combines the EM-algorithm with a greedy search. This is demonstrated to be faster and yield highly accurate results compared to earlier suggested clustering methods for the metagenomics application. Our model is fairly generic and could also be used for clustering of other types of sequence data for which Markov chains provide a reasonable way to compress information, as illustrated by experiments on shotgun sequence type data from an Escherichia coli strain.
Keywords: Clustering; DNA sequences; Markov chains; metagenomics (search for similar items in EconPapers)
Date: 2014
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1515/sagmb-2013-0031 (text/html)
For access to full text, subscription to the journal or payment for the individual article is required.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bpj:sagmbi:v:13:y:2014:i:1:p:105-121:n:7
Ordering information: This journal article can be ordered from
https://www.degruyter.com/journal/key/sagmb/html
DOI: 10.1515/sagmb-2013-0031
Access Statistics for this article
Statistical Applications in Genetics and Molecular Biology is currently edited by Michael P. H. Stumpf
More articles in Statistical Applications in Genetics and Molecular Biology from De Gruyter
Bibliographic data for series maintained by Peter Golla ().