Constructing benchmark test sets for biological sequence analysis using independent set algorithms

Petti, Samantha; Eddy, Sean R

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

Samantha Petti and Sean R Eddy

PLOS Computational Biology, 2022, vol. 18, issue 3, 1-14

Abstract: Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.Author summary: Typically, machine learning and statistical inference models are trained on a “training” dataset and evaluated on an separate “test” set. This ensures that the reported performance accurately reflects how well the method would do on previously unseen data. Biological sequences (such as protein or RNA) within a particular family are related by evolution and therefore may be very similar to each other. In this case, applying a standard approach of randomly splitting the data into training and test sets could yield test sequences that are nearly identical to some sequence in the training set, and the resultant benchmark may overstate the model’s performance. This motivates the design of strategies for dividing sequence families into dissimilar training and test sets. To this end, we used ideas from computer science involving graph algorithms to design two new methods for splitting sequence data into dissimilar training and test sets. These algorithms can successfully produce dissimilar training and test sets for more protein families than a previous approach, allowing us to include more families in benchmark datasets for biological sequence analysis tasks.

Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009492 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 09492&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1009492

DOI: 10.1371/journal.pcbi.1009492

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().