Heuristic multi-site optimization for protein sequence design using Masked Protein Language Models
Lijuan Wang,
Yuze Wang,
Chen Qiu,
Liwei Xiao,
Xianliang Liu and
Junjie Chen
PLOS Computational Biology, 2026, vol. 22, issue 6, 1-22
Abstract:
Protein sequence design for tailored functional properties is a fundamental task in protein engineering, with critical applications in drug discovery and therapeutic development. Efficient navigation of the combinatorial vastness of protein sequence space to identify functional variants remains a formidable challenge. Conventional approaches, which predominantly rely on template-based local search or single-residue mutagenesis, are constrained by their susceptibility to local optima and their potential risk of destabilizing native structural stability. In this study, we introduce ProtHMSO, a heuristic multi-site optimization framework leveraging masked protein language models (ProtLMs) for context-aware sequence exploration. ProtHMSO mimics natural evolutionary mechanisms by employing ProtLM-derived substitution probabilities to guide heuristic searches for synergistic mutations, thereby constraining combinatorial search spaces through evolutionary and biophysical priors. ProtHMSO is further applied to replace the exploration strategies in genetic algorithms (GAs) and Monte Carlo tree search (MCTS) for improving their convergence efficiency. Benchmark experiments demonstrate that protein sequences generated by ProtHMSO exhibit superior functional performance and closer alignment with natural sequence distribution, compared with state-of-the-art methods. These advancements highlight that ProtHMSO has strong potential and compatibility to accelerate functional protein discovery, offering a robust framework for efficient and context-aware exploration of protein sequence space.Author summary: To address the challenge of efficiently discovering functional new proteins in protein engineering due to the vast sequence space, and to overcome the limitations of traditional evolutionary algorithms that rely on blind random mutagenesis, resulting in inefficiency and prone to structural destabilization, we proposed a heuristic multi-site optimization framework, ProtHMSO. Its core concept is to leverage the powerful contextual prediction capabilities of masked protein language models (such as ESM-2) to guide sequence mutagenesis. By predicting amino acid substitutions at specific sites that are consistent with evolutionary laws and biophysical priors, ProtHMSO narrows the exploration scope from the vast combinatorial space to a small number of high-potential candidate sequences, achieving intelligent and efficient optimization of protein sequences. Furthermore, ProtHMSO is not just a standalone algorithm, but also a plug-and-play enhancement module. By integrating it into a genetic algorithm (GA) and a Monte Carlo tree search (MCTS), it replaces the random mutation operator in the former with its intelligent mutation and guides the tree expansion process in the latter. This enables these classic optimization algorithms to break free from the blindness of exploration and achieve faster convergence and better results, demonstrating the wide applicability and great potential of this framework in improving the performance of tools in the entire field of computational protein design.
Date: 2026
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1014365 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 14365&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1014365
DOI: 10.1371/journal.pcbi.1014365
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().