Ab-initio amino acid sequence design from protein text description with ProtDAT
Xiao-Yu Guo,
Yi-Fan Li,
Yuan Liu,
Xiaoyong Pan and
Hong-Bin Shen ()
Additional contact information
Xiao-Yu Guo: Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing
Yi-Fan Li: Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing
Yuan Liu: Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing
Xiaoyong Pan: Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing
Hong-Bin Shen: Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing
Nature Communications, 2025, vol. 16, issue 1, 1-14
Abstract:
Abstract Protein design has become a critical method in advancing significant potential for various applications such as drug development and enzyme engineering. However, protein design methods utilizing large language models with solely pretraining and fine-tuning struggle to capture relationships in multi-modal protein data. To address this, we propose ProtDAT, a de novo fine-grained multi-modal data interaction framework capable of designing proteins from descriptive protein text input. ProtDAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. It leverages a novel Multi-modal Cross-attention, integrating protein sequences and textual information for a foundational level and seamless integration. Evaluation metrics such as pLDDT, TM-score and RMSD are implemented to evaluate the structural plausibility, functionality, structural similarity, and validity of protein sequences. Experiments on 20,000 text-sequence pairs from Swiss-Prot within the ProtDAT framework demonstrate higher accuracy compared to the performance of the best method in the experiments, with a 23.34% increase in pLDDT, a 76.45% increase in TM-score, and a 24.41% reduction in RMSD.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41467-025-65562-w Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-65562-w
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-025-65562-w
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().