EconPapers    
Economics at your fingertips  
 

A systematic analysis of regression models for protein engineering

Richard Michael, Jacob Kæstel-Hansen, Peter Mørch Groth, Simon Bartels, Jesper Salomon, Pengfei Tian, Nikos S Hatzakis and Wouter Boomsma

PLOS Computational Biology, 2024, vol. 20, issue 5, 1-22

Abstract: To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.Author summary: Supervised machine learning is increasingly used to predict the function and properties of proteins. The performance obtained with these methods relies on a multitude of factors including how data is represented, how observations are distributed, how training is conducted, and how performance is measured. In this paper, we systematically assess the importance of these different components in a protein regression pipeline. We discuss the benefits of using representations extracted from protein language models, the impact of the choice of regression algorithm, and the role of uncertainty. Finally, to avoid misleading performance claims, we stress the need for carefully aligning the train/test setup to reflect the setting in which the prediction algorithm will ultimately be applied.

Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012061 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 12061&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1012061

DOI: 10.1371/journal.pcbi.1012061

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-05-03
Handle: RePEc:plo:pcbi00:1012061