Bias and Efficiency Loss Due to Categorizing an Explanatory Variable
Jeremy M. G. Taylor and
Menggang Yu
Journal of Multivariate Analysis, 2002, vol. 83, issue 1, 248-263
Abstract:
It is a common situation in biomedical research that one or more variables are known to be associated with the outcome of interest. Researchers often discretize some variables and fit a regression model using these discretized variables. Although convenient for illustration purposes, such an approach can be biased and lead to loss of efficiency. In this article, we consider the situation of a regression model with two explanatory variables under an assumption of multivariate normality. We investigate the effect of dichotomizing or categorizing one variable on the estimate of the coefficient of the other continuous variable and on prediction from the models. Algebraic expressions are presented for the asymptotic bias and variance of the coefficient of the continuous explanatory variable and for the residual sum of squares for prediction. Some numerical examples are presented in which we find that the bias of the coefficient of the continuous explanatory variable is always smaller for the categorized model than that for the dichotomized model. The size of the test of a zero coefficient for the continuous variable only depends on the correlations between the response variable, the discretized variable, and the continuous variable. The size of the test for the categorized model is always smaller than for the dichotomized model, however, both can differ substantially from the nominal level if the correlation between the response and the categorical variable or between the two explanatory variables is high. The (predictive) relative efficiency of models also only depends on correlations amongst the three variables. There is a substantial loss of efficiency due to categorization if the correlation between the categorized and response variable is high. The predictive relative efficiency is always higher for the categorized model. The relative predictive efficiency due to dichotomization depends on the choice of cut points, with the least loss of efficency being achieved at the median.
Keywords: cutpoints; discretization; regression (search for similar items in EconPapers)
Date: 2002
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)
Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S0047-259X(01)92045-7
Full text for ScienceDirect subscribers only
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:eee:jmvana:v:83:y:2002:i:1:p:248-263
Ordering information: This journal article can be ordered from
http://www.elsevier.com/wps/find/supportfaq.cws_home/regional
https://shop.elsevie ... _01_ooc_1&version=01
Access Statistics for this article
Journal of Multivariate Analysis is currently edited by de Leeuw, J.
More articles in Journal of Multivariate Analysis from Elsevier
Bibliographic data for series maintained by Catherine Liu ().