Variable selection techniques after multiple imputation in high-dimensional data
Faisal Maqbool Zahid (),
Shahla Faisal () and
Christian Heumann ()
Additional contact information
Faisal Maqbool Zahid: Government College University Faisalabad
Shahla Faisal: Government College University Faisalabad
Christian Heumann: Ludwig-Maximilians-University Munich
Statistical Methods & Applications, 2020, vol. 29, issue 3, No 6, 553-580
Abstract:
Abstract High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.
Keywords: High-dimensional data; Multiple imputation; LASSO; Rubin’s rules; Variable selection (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s10260-019-00493-7 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:stmapp:v:29:y:2020:i:3:d:10.1007_s10260-019-00493-7
Ordering information: This journal article can be ordered from
http://www.springer. ... cs/journal/10260/PS2
DOI: 10.1007/s10260-019-00493-7
Access Statistics for this article
Statistical Methods & Applications is currently edited by Tommaso Proietti
More articles in Statistical Methods & Applications from Springer, Società Italiana di Statistica
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().