Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: Lysine methylome as an example
Guoyang Zou,
Yang Zou,
Chenglong Ma,
Jiaojiao Zhao and
Lei Li
PLOS Computational Biology, 2021, vol. 17, issue 12, 1-14
Abstract:
Many computational classifiers have been developed to predict different types of post-translational modification sites. Their performances are measured using cross-validation or independent test, in which experimental data from different sources are mixed and randomly split into training and test sets. However, the self-reported performances of most classifiers based on this measure are generally higher than their performances in the application of new experimental data. It suggests that the cross-validation method overestimates the generalization ability of a classifier. Here, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome (Kme) as an example and developed a deep learning-based Kme site predictor (called DeepKme) with outstanding performance. We assessed the experiment-split test by comparing it with the cross-validation method. We found that the performance measured using the experiment-split test is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the predictor. Therefore, we believe that the experiment-split method can be applied to benchmark the practical performance of a given PTM model. DeepKme is free accessible via https://github.com/guoyangzou/DeepKme.Author summary: The performance of a model for predicting post-translational modification sites is commonly evaluated using the cross-validation method, where the data derived from different experimental sources are mixed and randomly separated into the training dataset and validation dataset. However, the performance measured through cross-validation is generally higher than the performance in the application of new experimental data, indicating that the cross-validation method overestimates the generalization of a model. In this study, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome as an example and developed a deep learning-based Kme site predictor DeepKme with outstanding performance. We found that the performance measured by the experiment-split method is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the prediction model. Therefore, the experiment-split method can be applied to benchmark the practical prediction performance.
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009682 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 09682&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1009682
DOI: 10.1371/journal.pcbi.1009682
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().