Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Qiao, Mengke; Huang, Ke-Wei

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Mengke Qiao () and Ke-Wei Huang ()
Additional contact information
Mengke Qiao: International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China
Ke-Wei Huang: Department of Information Systems and Analytics, National University of Singapore, Singapore 117417

Information Systems Research, 2021, vol. 32, issue 2, 462-480

Abstract: As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent or dependent variables for further analysis via standard regression methods. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization in the standard procedure. No matter which performance metric is chosen, the constructed variable still includes classification error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent regression coefficient estimates in the following phase, which has been documented as a problem of measurement error in the econometrics literature. The pioneering discussions on the issue of estimation inconsistency because of misclassification in these studies have been provided. Our study attempts to investigate systematically the theoretical foundation of this problem when a newly constructed variable is used as the independent or dependent variable in linear and nonlinear regressions. Our theoretical analysis shows that consistent regression estimators can be recovered in all models studied in this paper. The main implication of our theoretical result is that researchers do not need to tune the classification algorithm to minimize the inconsistency of estimated regression coefficients because the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Instead, we propose that a classification algorithm should be tuned to minimize the standard error of the focal regression coefficient derived based on the corrected formula. As a result, researchers can derive a consistent and most precise estimator in all models studied in this paper.

Keywords: data mining; econometrics; measurement error; misclassification; statistical inference; performance metric (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (4)

Downloads: (external link)
http://dx.doi.org/10.1287/isre.2020.0977 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:orisre:v:32:y:2021:i:2:p:462-480

Access Statistics for this article

More articles in Information Systems Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().