EconPapers    
Economics at your fingertips  
 

On Data-Enriched Logistic Regression

Cheng Zheng, Sayan Dasgupta, Yuxiang Xie, Asad Haris and Ying-Qing Chen ()
Additional contact information
Cheng Zheng: Department of Biostatistics, University of Nebraska Medical Center, Omaha, NE 68198, USA
Sayan Dasgupta: Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
Yuxiang Xie: Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
Asad Haris: Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
Ying-Qing Chen: Department of Medicine, Stanford University, Palo Alto, CA 94305, USA

Mathematics, 2025, vol. 13, issue 3, 1-21

Abstract: Biomedical researchers typically investigate the effects of specific exposures on disease risks within a well-defined population. The gold standard for such studies is to design a trial with an appropriately sampled cohort. However, due to the high cost of such trials, the collected sample sizes are often limited, making it difficult to accurately estimate the effects of certain exposures. In this paper, we discuss how to leverage the information from external “big data” (datasets with significantly larger sample sizes) to improve the estimation accuracy at the risk of introducing a small amount of bias. We propose a family of weighted estimators to balance bias increase and variance reduction when incorporating the big data. We establish a connection between our proposed estimator and the well-known penalized regression estimators. We derive optimal weights using both second-order and higher-order asymptotic expansions. Through extensive simulation studies, we demonstrate that the improvement in mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes, and our weighted method outperformed existing approaches such as penalized regression and James–Stein estimator. Additionally, we provide a theoretical guarantee that the proposed estimators will never yield an asymptotic MSE larger than the maximum likelihood estimator using small data only in general. Finally, we apply our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use, and mortality.

Keywords: risk prediction; logistic regression; shrinkage estimator; big data (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/3/441/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/3/441/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:3:p:441-:d:1579109

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-22
Handle: RePEc:gam:jmathe:v:13:y:2025:i:3:p:441-:d:1579109