Evaluating Imputation Methods to Improve Prediction Accuracy for an HIV Study in Uganda
Nadia B. Mendoza,
Chii-Dean Lin,
Susan M. Kiene,
Nicolas A. Menzies,
Rhoda K. Wanyenze,
Katherine A. Schmarje,
Rose Naigino,
Michael Ediau,
Seth C. Kalichman and
Barbara A. Bailey ()
Additional contact information
Nadia B. Mendoza: Department of Mathematics and Statistics, San Diego State University, San Diego, CA 92182, USA
Chii-Dean Lin: Department of Mathematics and Statistics, San Diego State University, San Diego, CA 92182, USA
Susan M. Kiene: Department of Disease Control and Environmental Health, Makerere University School of Public Health, Kampala P.O. Box 7072, Uganda
Nicolas A. Menzies: Department of Global Health and Population, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
Rhoda K. Wanyenze: Department of Disease Control and Environmental Health, Makerere University School of Public Health, Kampala P.O. Box 7072, Uganda
Katherine A. Schmarje: Division of Epidemiology and Biostatistics, San Diego State University School of Public Health, San Diego, CA 92182, USA
Rose Naigino: Department of Disease Control and Environmental Health, Makerere University School of Public Health, Kampala P.O. Box 7072, Uganda
Michael Ediau: Division of Epidemiology and Biostatistics, San Diego State University School of Public Health, San Diego, CA 92182, USA
Seth C. Kalichman: Institute for Collaboration on Health, Intervention and Policy, University of Connecticut, Storrs, CT 06269, USA
Barbara A. Bailey: Department of Mathematics and Statistics, San Diego State University, San Diego, CA 92182, USA
Stats, 2024, vol. 7, issue 4, 1-16
Abstract:
Standard statistical analyses often exclude incomplete observations, which can be particularly problematic when predicting rare outcomes, such as HIV positivity. In the linkage to the HIV care dataset, there were initially 553 complete HIV positive cases, with an additional 554 cases added through imputation. Imputation methods amelia , hmisc , mice and missForest were evaluated. Simulations were conducted across various scenarios using the complete data to guide imputation for the full dataset. A random forest model was used to predict HIV status, assessing imputation precision, overall prediction accuracy, and sensitivity. While missForest produced imputed values closer to the observed ones, this did not translate into better predictive models. Hmisc and mice imputations led to higher prediction accuracy and sensitivity, with median accuracy increasing from 64% to 76% and median sensitivity rising from 0.4 to 0.75. Hmisc and amelia were the fastest imputation methods. Additionally, oversampling the minority class combined with undersampling the majority class did not improve predictions of new HIV positive cases using only the complete observations. However, increasing the minority class information through imputation enhanced sensitivity for predicting cases in this class.
Keywords: missing data imputation; random forest classification; amelia; hmisc; mice; missForest; imbalanced outcome (search for similar items in EconPapers)
JEL-codes: C1 C10 C11 C14 C15 C16 (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2571-905X/7/4/82/pdf (application/pdf)
https://www.mdpi.com/2571-905X/7/4/82/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jstats:v:7:y:2024:i:4:p:82-1420:d:1528246
Access Statistics for this article
Stats is currently edited by Mrs. Minnie Li
More articles in Stats from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().