Discovery of CRISPR-Cas12a clades using a large language model
Yuanyuan Feng,
Junchao Shi,
Zhanwei Li,
Yongqian Li,
Jiaxi Yang,
Shisheng Huang,
Jinfang Zheng,
Wei Han,
Yunbo Qiao,
Jun Zhang,
Qi Liu,
Yao Yang,
Chunyi Hu,
Lina Wu,
Xiaokang Zhang (),
Jin Tang (),
Xingxu Huang () and
Peixiang Ma ()
Additional contact information
Yuanyuan Feng: Zhejiang Lab
Junchao Shi: Zhejiang Lab
Zhanwei Li: Zhejiang Lab
Yongqian Li: Zhejiang Lab
Jiaxi Yang: Zhejiang Lab
Shisheng Huang: Zhejiang Lab
Jinfang Zheng: Zhejiang Lab
Wei Han: Zhejiang Lab
Yunbo Qiao: Shanghai Jiao Tong University School of Medicine
Jun Zhang: Nanjing Medical University
Qi Liu: Tongji University
Yao Yang: Zhejiang Lab
Chunyi Hu: National University of Singapore
Lina Wu: Nanjing Normal University
Xiaokang Zhang: Chinese Academy of Sciences
Jin Tang: Zhejiang Lab
Xingxu Huang: Zhejiang Lab
Peixiang Ma: Shanghai Jiao Tong University School of Medicine
Nature Communications, 2025, vol. 16, issue 1, 1-17
Abstract:
Abstract CRISPR-Cas systems revolutionize life science. Metagenomes contain millions of unknown Cas proteins. Traditional mining relies on protein sequence alignments. In this work, we employ an evolutionary scale language model (ESM) to learn the information beyond sequences. Trained with CRISPR-Cas data, ESM accurately identifies Cas proteins without alignment. Limited experimental data restricts feature prediction, but integrating with machine learning enables trans-cleavage activity prediction of uncharacterized Cas12a. We discover 7 undocumented Cas12a subtypes with unique CRISPR loci. Structural analyses reveal 8 subtypes of Cas1, Cas2, and Cas4. Cas12a subtypes display distinct 3D-folds. CryoEM analyses unveil unique RNA interactions with the uncharacterized Cas12a. These proteins show distinct double-strand and single-strand DNA cleavage preferences and broad PAM recognition. Finally, we establish a specific detection strategy for the oncogene SNP without traditional Cas12a PAM. This study highlights the potential of language models in exploring undocumented Cas protein function via gene cluster classification.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41467-025-63160-4 Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-63160-4
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-025-63160-4
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().