rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

Hahn, Lars; Leimeister, Chris-André; Ounit, Rachid; Lonardi, Stefano; Morgenstern, Burkhard

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

Lars Hahn, Chris-André Leimeister, Rachid Ounit, Stefano Lonardi and Burkhard Morgenstern

PLOS Computational Biology, 2016, vol. 12, issue 10, 1-18

Abstract: Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/Author Summary: We propose a fast algorithm to generate spaced seeds for database searching, read mapping and alignment-free sequence comparison. Spaced seeds—i.e. patterns of match and don’t-care positions—are used by many algorithms for sequence analysis; designing optimal seeds is therefore an active field of research. In sequence-database searching, one wants to optimize sensitivity, i.e. the probability of finding a region of homology; this can be done by minimizing the so-called overlap complexity of pattern sets. In alignment-free DNA sequence comparison, the number N of pattern-based matches is used to estimate phylogenetic distances. Here, one wants to minimize the variance of N in order to obtain stable phylogenies. We show that for spaced seeds, the overlap complexity—and therefore the sensitivity in database searching—is closely related to the variance of N. Our algorithm can optimize the sensitivity, overlap complexity or the variance of N, depending on the application at hand.

Date: 2016
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005107 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 05107&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1005107

DOI: 10.1371/journal.pcbi.1005107

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().