Random access and semantic search in DNA data storage enabled by Cas9 and machine-guided design

Imburgia, Carina; Organick, Lee; Zhang, Karen; Cardozo, Nicolas; McBride, Jeff; Bee, Callista; Wilde, Delaney; Roote, Gwendolin; Jorgensen, Sophia; Ward, David; Anderson, Charlie; Strauss, Karin; Ceze, Luis; Nivala, Jeff

Random access and semantic search in DNA data storage enabled by Cas9 and machine-guided design

Carina Imburgia, Lee Organick, Karen Zhang, Nicolas Cardozo, Jeff McBride, Callista Bee, Delaney Wilde, Gwendolin Roote, Sophia Jorgensen, David Ward, Charlie Anderson, Karin Strauss, Luis Ceze and Jeff Nivala ()
Additional contact information
Carina Imburgia: Paul G. Allen School of Computer Science and Engineering
Lee Organick: Paul G. Allen School of Computer Science and Engineering
Karen Zhang: Paul G. Allen School of Computer Science and Engineering
Nicolas Cardozo: Paul G. Allen School of Computer Science and Engineering
Jeff McBride: Paul G. Allen School of Computer Science and Engineering
Callista Bee: Paul G. Allen School of Computer Science and Engineering
Delaney Wilde: Paul G. Allen School of Computer Science and Engineering
Gwendolin Roote: Paul G. Allen School of Computer Science and Engineering
Sophia Jorgensen: Paul G. Allen School of Computer Science and Engineering
David Ward: Paul G. Allen School of Computer Science and Engineering
Charlie Anderson: Paul G. Allen School of Computer Science and Engineering
Karin Strauss: Microsoft Research
Luis Ceze: Paul G. Allen School of Computer Science and Engineering
Jeff Nivala: Paul G. Allen School of Computer Science and Engineering

Nature Communications, 2025, vol. 16, issue 1, 1-11

Abstract: Abstract DNA is a promising medium for digital data storage due to its exceptional data density and longevity. Practical DNA-based storage systems require selective data retrieval to minimize decoding time and costs. In this work, we introduce CRISPR-Cas9 as a user-friendly tool for multiplexed, low-latency molecular data extraction. We first present a one-pot, multiplexed random access method in which specific data files are selectively cleaved using a CRISPR-Cas9 addressing system and then sequenced via nanopore technology. This approach was validated on a pool of 1.6 million DNA sequences, comprising 25 unique data files. We then developed a molecular similarity-search approach combining machine learning with Cas9-based retrieval. Using a deep neural network, we mapped a database of 1.74 million images into a reduced-dimensional embedding, encoding each embedding as a Cas9 target sequence. These target sequences act as molecular addresses, capturing clusters of semantically related images. By leveraging Cas9’s off-target cleavage activity, query sequences cleave both exact and closely related targets, enabling high-fidelity retrieval of molecular addresses corresponding to in silico image clusters similar to the query. These approaches move towards addressing key challenges in molecular data retrieval by offering simplified, rapid isothermal protocols and new DNA data access capabilities.

Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.nature.com/articles/s41467-025-61264-5 Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-61264-5

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-025-61264-5

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().