EconPapers    
Economics at your fingertips  
 

Overcoming extrapolation challenges of deep learning by incorporating physics in protein sequence-function modeling

Shrishti Barethiya, Jian Huang, Clarice Stumpf, Xiao Liu, Hui Guan and Jianhan Chen

PLOS Computational Biology, 2026, vol. 22, issue 3, 1-23

Abstract: Understanding protein sequence-to-function relationship is crucial to assist studies of genetic diseases, protein evolution, and protein engineering. The sequence-to-function relationship of proteins is inherently complex due to multi-site high-dimensional correlation and structural dynamics. Deep learning algorithms such as (graph) convolutional neural networks and recently transformers have become very popular for learning the protein sequence-to-function mapping from deep mutational scanning data and available structures. However, it remains very challenging for these models to achieve accurate extrapolation when predicting functional effect of variants with positions or mutation types not seen in the training data. We propose that incorporating the physics of protein interactions and dynamics can be an effective approach to overcome the extrapolation limitations. Specifically, we demonstrate that biophysics-based modeling can be used to quantify the energetic effects of mutations and that incorporating these physical energetics directly within the convolution and graph convolution neural networks can significantly improve the performance of positional and mutational extrapolation compared to models without biophysics-inspired features. Our results support the effectiveness of leveraging physical knowledge in overcoming the limitation of data scarcity.Author summary: Deep learning has fundamentally transformed science and research in recent years. Yet, many problems in biophysics and biochemistry remain inaccessible to traditional deep learning due to a lack of large training data. Incorporating physical principles in machine learning is arguably required to overcome data scarcity. In this work, we examine the effectiveness of incorporating biophysics-based features in deriving more reliable predictors of the effects of sequence variants on protein function. Our results show that including the energetics of mutational effect on protein stability can significantly improve machine learning models’ ability to predict novel mutations not seen in the training data set, especially for mutations on novel sequence positions. Further incorporation of sequence evolutionary information offered by pre-trained protein large language models could further improve the predictive power. Our work thus provides an efficient framework for training better variant effect predictors from deep mutational scanning dataset. The result predictors can aid protein engineering and the prioritization of studying genetic variations in diseases.

Date: 2026
References: Add references at CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013728 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13728&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013728

DOI: 10.1371/journal.pcbi.1013728

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2026-03-29
Handle: RePEc:plo:pcbi00:1013728