EconPapers    
Economics at your fingertips  
 

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Caitlin Guccione, Lucas Patel, Yoshihiko Tomofuji, Daniel McDonald, Antonio Gonzalez, Gregory D. Sepich-Poore, Kyuto Sonehara, Mohsen Zakeri, Yang Chen, Amanda Hazel Dilmore, Neil Damle, Sergio E. Baranzini, George Hightower, Teruaki Nakatsuji, Richard L. Gallo, Ben Langmead, Yukinori Okada, Kit Curtius () and Rob Knight ()
Additional contact information
Caitlin Guccione: University of California San Diego
Lucas Patel: University of California San Diego
Yoshihiko Tomofuji: the University of Tokyo
Daniel McDonald: University of California San Diego
Antonio Gonzalez: University of California San Diego
Gregory D. Sepich-Poore: University of California San Diego
Kyuto Sonehara: the University of Tokyo
Mohsen Zakeri: Johns Hopkins University
Yang Chen: University of California San Diego
Amanda Hazel Dilmore: University of California San Diego
Neil Damle: University of California San Diego
Sergio E. Baranzini: San Francisco (UCSF)
George Hightower: University of California San Diego
Teruaki Nakatsuji: University of California San Diego
Richard L. Gallo: University of California San Diego
Ben Langmead: Johns Hopkins University
Yukinori Okada: the University of Tokyo
Kit Curtius: University of California San Diego
Rob Knight: University of California San Diego

Nature Communications, 2025, vol. 16, issue 1, 1-14

Abstract: Abstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.

Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.nature.com/articles/s41467-025-56077-5 Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-56077-5

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-025-56077-5

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-56077-5