Distribution-free tests for lossless feature selection in classification and regression

Györfi, László; Linder, Tamás; Walk, Harro

Distribution-free tests for lossless feature selection in classification and regression

László Györfi (), Tamás Linder () and Harro Walk ()
Additional contact information
László Györfi: Budapest University of Technology and Economics
Tamás Linder: Queen’s University
Harro Walk: Institut für Stochastik und Anwendungen, Universität Stuttgart

TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, 2025, vol. 34, issue 1, No 10, 262-287

Abstract: Abstract We study the problem of lossless feature selection for a d-dimensional feature vector $$X=(X^{(1)},\dots ,X^{(d)})$$ X = ( X ( 1 ) , ⋯ , X ( d ) ) and label Y for binary classification as well as nonparametric regression. For an index set $$S\subset \{1,\dots ,d\}$$ S ⊂ { 1 , ⋯ , d } , consider the selected |S|-dimensional feature subvector $$X_S=(X^{(i)}, i\in S)$$ X S = ( X ( i ) , i ∈ S ) . If $$L^*$$ L ∗ and $$L^*(S)$$ L ∗ ( S ) stand for the minimum risk based on X and $$X_S$$ X S , respectively, then $$X_S$$ X S is called lossless if $$L^*=L^*(S)$$ L ∗ = L ∗ ( S ) . For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor-based test statistics to test the hypothesis that $$X_S$$ X S is lossless. This test statistic is an estimate of the excess risk $$L^*(S)-L^*$$ L ∗ ( S ) - L ∗ . Surprisingly, estimating this excess risk turns out to be a functional estimation problem that does not suffer from the curse of dimensionality in the sense that the convergence rate does not depend on the dimension d. For the threshold $$a_n=\log n/\sqrt{n}$$ a n = log n / n , the corresponding tests are proved to be consistent under conditions on the distribution of (X, Y) that are significantly milder than in previous work. Also, our threshold is universal (dimension independent), in contrast to earlier methods where for large d the threshold becomes too large to be useful in practice.

Keywords: Classification; Nonparametric regression; Lossless feature selection; Nearest-neighbor estimate; Consistent test; 62G05; 62G10; 62G08 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s11749-024-00958-2 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:testjl:v:34:y:2025:i:1:d:10.1007_s11749-024-00958-2

Ordering information: This journal article can be ordered from
http://www.springer. ... cs/journal/11749/PS2

DOI: 10.1007/s11749-024-00958-2

Access Statistics for this article

TEST: An Official Journal of the Spanish Society of Statistics and Operations Research is currently edited by Alfonso Gordaliza and Ana F. Militino

More articles in TEST: An Official Journal of the Spanish Society of Statistics and Operations Research from Springer, Sociedad de Estadística e Investigación Operativa
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().