EconPapers    
Economics at your fingertips  
 

OneProt: Towards multi-modal protein foundation models via latent space alignment of sequence, structure, binding sites and text encoders

Klemens Flöge, Srisruthi Udayakumar, Johanna Sommer, Marie Piraud, Stefan Kesselheim, Vincent Fortuin, Stephan Günnemann, Karel J van der Weg, Holger Gohlke, Erinc Merdivan and Alina Bazarova

PLOS Computational Biology, 2025, vol. 21, issue 11, 1-27

Abstract: Recent advances in Artificial Intelligence have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal Deep Learning model for proteins that integrates structural, sequence, text, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine-tuning scheme that focuses on pairwise alignment with sequence data, rather than requiring full matches. This novel approach comprises a mix of Graph Neural Networks and transformer architectures. It demonstrates good performance in retrieval tasks and showcases the efficacy of multi-modal systems in Protein Machine Learning through a broad spectrum of downstream baselines, including enzyme function prediction and binding site analysis. Furthermore, OneProt enables the transfer of representational information from specialized encoders to the sequence encoder, enhancing capabilities for distinguishing evolutionarily related and unrelated sequences and exhibiting representational properties where evolutionarily related proteins align in similar directions within the latent space. In addition, we extensively investigate modality ablations to identify the encoders that contribute the most to predictive performance, highlighting the significance of the binding site encoder, which has not been used in similar models previously. This work expands the horizons of multi-modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.Author summary: In this study, we introduce OneProt, a novel, versatile Artificial Intelligence system designed for protein analysis. In order to integrate different types of data, structural, sequence, text, and binding sites, OneProt uses the ImageBind framework, efficiently aligning protein data without needing full matches. Combining Graph Neural Networks and transformer architectures, OneProt excels in tasks like enzyme function prediction and binding site analysis. It enhances the understanding of protein relationships by transferring information between different data types, making it easier to identify related proteins. The OneProt framework stands out for two key features: the ability to incorporate custom modalities during pre-training and a simple fine-tuning process that requires only a Multi-Layer Perceptron projection. Notably, we also show that incorporating multiple modalities can reduce the need for extensive datasets and training, leading to competitive downstream performance. In addition, we conduct an exhaustive ablation study, where we highlight the crucial role of the binding site encoder, which has not been used in similar models before. Overall, OneProt represents a significant step forward in multi-modal protein modeling, with promising applications in drug discovery and protein engineering.

Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013679 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13679&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013679

DOI: 10.1371/journal.pcbi.1013679

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-11-29
Handle: RePEc:plo:pcbi00:1013679