Predicting the Metagenomics Content with Multiple CART Trees
Dante Travisany (),
Diego Galarce (),
Alejandro Maass () and
Rodrigo Assar ()
Additional contact information
Dante Travisany: Universidad de Chile, Departamento de Ingeniería Matemática, Center for Mathematical Modeling
Diego Galarce: Universidad de Chile, Departamento de Ingeniería Matemática, Center for Mathematical Modeling
Alejandro Maass: Universidad de Chile, Departamento de Ingeniería Matemática, Center for Mathematical Modeling
Rodrigo Assar: Universidad de Chile, Instituto de Ciencias Biomédicas, Escuela de Medicina
A chapter in Mathematical Models in Biology, 2015, pp 145-160 from Springer
Abstract:
Abstract Metagenomics is a technique for the characterization and identification of microbial genomes using direct isolation of genomic DNA from the environment without cultivation. One of the key step in this process is the taxonomic classification and clustering of the DNA fragments, process also known as binning. To date, the most common practice is classifying through alignments to public databases. When a representing specie is present in this database the process is simple and successful, if not, an underestimation of taxonomic abundances is produced. In this work we propose a alignment-free method capable of assign taxa to each read in the sample by analyzing the statistical properties of the reads. Given an environment, we collect genomes from public available databases and generate genomic fragments libraries. Then, statistics of k-mer frequencies, GC ratio and GC skew are computed for each read and stored in an environment-associated dataset used to build a robust machine learning procedure based on multiple CART trees. Finally, for each read the CART trees are asked about their taxa and the most voted ones are selected. The method was tested using simulated and public human gut microbiome data sets. The database was constructed using 98 genera present in Gastrointestinal Tract available at Human Microbiome Project. A multiple CART tree with 558-trees predictor was generated, capable to estimate the genus and abundance in the sample with 47 % of accuracy in read assignments. Performance rates are comparable with those from semi-supervised methods and also the computation times were reduced due to alignment-free methodology. Restricted to 17 early considered genera, our method increases its accuracy to 77 %.
Keywords: Metagenomics content prediction; Human gut microbiome; CART trees; K-mer frequencies (search for similar items in EconPapers)
Date: 2015
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-3-319-23497-7_11
Ordering information: This item can be ordered from
http://www.springer.com/9783319234977
DOI: 10.1007/978-3-319-23497-7_11
Access Statistics for this chapter
More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().