Is there a competitive advantage to using multivariate statistical or machine learning methods over the Bross formula in the hdPS framework for bias and variance estimation?

Karim, Mohammad Ehsanul; Lei, Yang

Is there a competitive advantage to using multivariate statistical or machine learning methods over the Bross formula in the hdPS framework for bias and variance estimation?

Mohammad Ehsanul Karim and Yang Lei

PLOS ONE, 2025, vol. 20, issue 5, 1-19

Abstract: Purpose: We aim to evaluate various proxy selection methods within the context of high-dimensional propensity score (hdPS) analysis. This study aimed to systematically evaluate and compare the performance of traditional statistical methods and machine learning approaches within the hdPS framework, focusing on key metrics such as bias, standard error (SE), and coverage, under various exposure and outcome prevalence scenarios. Methods: We conducted a plasmode simulation study using data from the National Health and Nutrition Examination Survey (NHANES) cycles from 2013 to 2018. We compared methods including the kitchen sink model, Bross-based hdPS, Hybrid hdPS, LASSO, Elastic Net, Random Forest, XGBoost, and Genetic Algorithm (GA). The performance of each inverse probability weighted method was assessed based on bias, MSE, coverage probability, and SE estimation across three epidemiological scenarios: frequent exposure and outcome, rare exposure and frequent outcome, and frequent exposure and rare outcome. Results: XGBoost consistently demonstrated strong performance in terms of MSE and coverage, making it effective for scenarios prioritizing precision. However, it exhibited higher bias, particularly in rare exposure scenarios, suggesting it is less suited when minimizing bias is critical. In contrast, GA showed significant limitations, with consistently high bias and MSE, making it the least reliable method. Bross-based hdPS, and Hybrid hdPS methods provided a balanced approach, with low bias and moderate MSE, though coverage varied depending on the scenario. Rare outcome scenarios generally resulted in lower MSE and better precision, while rare exposure scenarios were associated with higher bias and MSE. Notably, traditional statistical approaches such as forward selection and backward elimination performed comparably to more sophisticated machine learning methods in terms of bias and coverage, suggesting that these simpler approaches may be viable alternatives due to their computational efficiency. Conclusion: The results highlight the importance of selecting hdPS methods based on the specific characteristics of the data, such as exposure and outcome prevalence. While advanced machine learning methods such as XGBoost can enhance precision, simpler methods such as forward selection or backward elimination may offer similar performance in terms of bias and coverage with fewer computational demands. Tailoring the choice of method to the epidemiological scenario is essential for optimizing the balance between bias reduction and precision.

Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0324639 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 24639&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0324639

DOI: 10.1371/journal.pone.0324639

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().