# The effect of missing data on covariates in survival analysis

*Irit Aitkin*

Additional contact information

Irit Aitkin: Department of Psychology, University of Melbourne

No 6, Australasian Stata Users' Group Meetings 2004 from Stata Users Group

**Abstract:**
We deal with this problem in the context of survival analysis with missing data on covariates. More specifically, we examine the factors affecting the duration of breastfeeding in Western Australia. Duration was studied in 556 women delivering at two maternity hospitals in Perth, Australia. The study was carried out over the period September 1992 to April 1993. 466 women breastfed when they left the hospital. In a previous analysis, the Cox proportional hazards model was fitted to determine the factors affecting duration of breastfeeding. However, because of missing data, a covariate known to be important, smoking, could not be used as it would have resulted in a loss of almost 50% of the available sample. In this analysis, we incorporate the incomplete data on smoking omitted from the previous analysis. We deal with the missing data on covariates in survival analysis in two ways--the first is by maximum likelihood and the second by multiple imputation. Direct maximization of the likelihood with missing data is complicated, and most methods that perform maximum likelihood estimation (for example, the EM algorithm) use some form of data augmentation, which augments the observed data with latent (unobserved) data, so that very complicated calculations are replaced by much simpler ones given the "complete data". The distribution of response time for cases with smoking missing is no longer a Cox model but a mixture of two such models, in proportions given by the population proportions of smokers and non-smokers. The likelihood function is therefore different for complete and incomplete cases, and so maximizing it is more complicated in having to allow for this difference. We carried out the ML analysis in Stata using GLLAMM (Generalized Linear Latent And Mixed Models) routines (Rabe-Hesketh, Pickles, and Skrondal 2001). In the GLLAMM procedure, a latent smoking variable is defined for the cases with smoking missing, and the breastfeeding durations are regressed on the explanatory variables and smoking--the covariate when it is observed and the latent variable when not. The model for the smoking covariate is a "measurement model" when the covariate is observed and a "structural model" when it is not. We compared ML using GLLAMM with multiple imputation using the program written by J.L Schafer mainly for S-Plus/R. It is based on the data augmentation algorithm (Tanner and Wong 1987).

**References:** View references in EconPapers View complete reference list from CitEc

**Citations** Track citations by RSS feed

There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.

**Related works:**

This item may be available elsewhere in EconPapers: Search for items with the same title.

**Export reference:** BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text

**Persistent link:** http://EconPapers.repec.org/RePEc:boc:osug04:6

Access Statistics for this paper

More papers in Australasian Stata Users' Group Meetings 2004 from Stata Users Group Contact information at EDIRC.

Series data maintained by Christopher F Baum ().