Designing diverse and high-performance proteins with a large language model in the loop

Gomez-Uribe, Carlos A; Gado, Japheth; Islamov, Meiirbek

Designing diverse and high-performance proteins with a large language model in the loop

Carlos A Gomez-Uribe, Japheth Gado and Meiirbek Islamov

PLOS Computational Biology, 2025, vol. 21, issue 6, 1-21

Abstract: We present a protein engineering approach to directed evolution with machine learning that integrates a new semi-supervised neural network fitness prediction model, Seq2Fitness, and an innovative optimization algorithm, biphasic annealing for diverse and adaptive sequence sampling (BADASS) to design sequences. Seq2Fitness leverages protein language models to predict fitness landscapes, combining evolutionary data with experimental labels, while BADASS efficiently explores these landscapes by dynamically adjusting temperature and mutation energies to prevent premature convergence and to generate diverse high-fitness sequences. Compared to alternative models, Seq2Fitness improves Spearman correlation with experimental fitness measurements, increasing from 0.34 to 0.55 on sequences containing mutations at positions entirely not seen during training. BADASS requires less memory and computation compared to gradient-based Markov Chain Monte Carlo methods, while generating more high-fitness and diverse sequences across two protein families. For both families, 100% of the top 10,000 sequences identified by BADASS exceed the wildtype in predicted fitness, whereas competing methods range from 3% to 99%, often producing far fewer than 10,000 sequences. BADASS also finds higher-fitness sequences at every cutoff (top 1, 100, and 10,000). Additionally, we provide a theoretical framework explaining BADASS’s underlying mechanism and behavior. While we focus on amino acid sequences, BADASS may generalize to other sequence spaces, such as DNA and RNA.Author summary: Designing proteins with enhanced properties is essential for many applications, from industrial enzymes to therapeutic molecules. However, traditional protein engineering methods often fail to explore the vast sequence space effectively, partly due to the rarity of high-fitness sequences. In this work, we introduce BADASS, an optimization algorithm that samples sequences from a probability distribution with mutation energies and a temperature parameter that are updated dynamically, alternating between cooling and heating phases, to discover high-fitness proteins while maintaining sequence diversity. This stands in contrast to traditional approaches like simulated annealing, which often converge on fewer and lower fitness solutions, and gradient-based Markov Chain Monte Carlo (MCMC), also converging on lower fitness solutions and at a significantly higher computational and memory cost. Our approach requires only forward model evaluations and no gradient computations, enabling the rapid design of high-performing proteins that can be validated in the lab, especially when combined with our Seq2Fitness models. BADASS represents a significant advancement in computational protein engineering, opening new possibilities for diverse applications. Our code is publicly available at https://github.com/SoluLearn/BADASS.

Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013119 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13119&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013119

DOI: 10.1371/journal.pcbi.1013119

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().