EconPapers    
Economics at your fingertips  
 

DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

Bobby Ranjan, Wenjie Sun, Jinyu Park, Kunal Mishra, Florian Schmidt, Ronald Xie, Fatemeh Alipour, Vipul Singhal, Ignasius Joanito, Mohammad Amin Honardoost, Jacy Mei Yun Yong, Ee Tzun Koh, Khai Pang Leong, Nirmala Arul Rayan, Michelle Gek Liang Lim and Shyam Prabhakar ()
Additional contact information
Bobby Ranjan: Genome Institute of Singapore
Wenjie Sun: Genome Institute of Singapore
Jinyu Park: Genome Institute of Singapore
Kunal Mishra: Genome Institute of Singapore
Florian Schmidt: Genome Institute of Singapore
Ronald Xie: Genome Institute of Singapore
Fatemeh Alipour: Genome Institute of Singapore
Vipul Singhal: Genome Institute of Singapore
Ignasius Joanito: Genome Institute of Singapore
Mohammad Amin Honardoost: Genome Institute of Singapore
Jacy Mei Yun Yong: Tan Tock Seng Hospital
Ee Tzun Koh: Tan Tock Seng Hospital
Khai Pang Leong: Tan Tock Seng Hospital
Nirmala Arul Rayan: Genome Institute of Singapore
Michelle Gek Liang Lim: Genome Institute of Singapore
Shyam Prabhakar: Genome Institute of Singapore

Nature Communications, 2021, vol. 12, issue 1, 1-12

Abstract: Abstract Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.

Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.nature.com/articles/s41467-021-26085-2 Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:12:y:2021:i:1:d:10.1038_s41467-021-26085-2

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-021-26085-2

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:nat:natcom:v:12:y:2021:i:1:d:10.1038_s41467-021-26085-2