LASSO and Elastic Net Tend to Over-Select Features

Liu, Lu; Gao, Junheng; Beasley, Georgia; Jung, Sin-Ho

LASSO and Elastic Net Tend to Over-Select Features

Lu Liu, Junheng Gao, Georgia Beasley and Sin-Ho Jung ()
Additional contact information
Lu Liu: Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA
Junheng Gao: Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA
Georgia Beasley: Department of Surgery, Duke University Medical Center, Durham, NC 27710, USA
Sin-Ho Jung: Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA

Mathematics, 2023, vol. 11, issue 17, 1-16

Abstract: Machine learning methods have been a standard approach to select features that are associated with an outcome and to build a prediction model when the number of candidate features is large. LASSO is one of the most popular approaches to this end. The LASSO approach selects features with large regression estimates, rather than based on statistical significance, that are associated with the outcome by imposing an L 1 -norm penalty to overcome the high dimensionality of the candidate features. As a result, LASSO may select insignificant features while possibly missing significant ones. Furthermore, from our experience, LASSO has been found to select too many features. By selecting features that are not associated with the outcome, we may have to spend more cost to collect and manage them in the future use of a fitted prediction model. Using the combination of L 1 - and L 2 -norm penalties, elastic net (EN) tends to select even more features than LASSO. The overly selected features that are not associated with the outcome act like white noise, so that the fitted prediction model may lose prediction accuracy. In this paper, we propose to use standard regression methods, without any penalizing approach, combined with a stepwise variable selection procedure to overcome these issues. Unlike LASSO and EN, this method selects features based on statistical significance. Through extensive simulations, we show that this maximum likelihood estimation-based method selects a very small number of features while maintaining a high prediction power, whereas LASSO and EN make a large number of false selections to result in loss of prediction accuracy. Contrary to LASSO and EN, the regression methods combined with a stepwise variable selection method is a standard statistical method, so that any biostatistician can use it to analyze high-dimensional data, even without advanced bioinformatics knowledge.

Keywords: logistic regression; machine learning; prediction model; ROC curve; variable selection (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/11/17/3738/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/17/3738/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:17:p:3738-:d:1229461

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().