Metagenome and Metatranscriptome Analyses Using Protein Family Profiles
Cuncong Zhong,
Anna Edlund,
Youngik Yang,
Jeffrey S McLean and
Shibu Yooseph
PLOS Computational Biology, 2016, vol. 12, issue 7, 1-22
Abstract:
Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx.Author Summary: Accurate analysis of microbial metabolism and function from metagenome and metatranscriptome data sets relies heavily on the comprehensive identification of protein family homologs present in these data. The task is routinely being done through alignment of the individual reads against the profile hidden Markov Models (HMM) of protein families in the reference database. This strategy, however, is hindered by the fact that the reads usually only represent partial protein sequences, which contain insufficient information for their accurate classification. To tackle this problem, we present a targeted assembly algorithm that, based on the sequence overlap information, simultaneously reconstructs complete or near-complete protein sequences and estimates their homology given the HMMs of the protein families of interest. The reconstructed protein sequences contain more complete information regarding the function of the corresponding protein, thus facilitating accurate annotation of themselves as well as the constituent sequencing reads. The resulting program, HMM-GRASPx, has been shown to have significantly improved performance (>40% higher recall rate with a similar level of precision rate) over other state-of-the-art counterparts such as RPS-BLAST and HMMER3.
Date: 2016
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004991 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 04991&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1004991
DOI: 10.1371/journal.pcbi.1004991
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().