Stata's mishandling of missing data: A problem and two solutions
Kenneth I. MacDonald ()
Additional contact information
Kenneth I. MacDonald: Nuffield College, University of Oxford
United Kingdom Stata Users' Group Meetings 2008 from Stata Users Group
Abstract:
The design decisions made by Stata in handling missing data in relational and logical expressions have, for the user, complex, pernicious, and poorly understood consequences. This presentation intends to substantiate that claim and to present two possible resolutions to the problem. As is well documented and reasonably well known, Stata considers p & q (and p | q) to be true when both p and q are indeterminate. This interpretation is counterintuitive and at odds with the formal-logic definition of these operators. To assert two unknowns is not to assert truth. Nevertheless, introductions to Stata characteristically present this as merely a “feature†and suggest that the obligation imposed on users (us) to explicitly test for missing data is straightforwardly implementable. Simple cases are indeed simple but, it will be argued, do not readily scale up to complex, real-life instances. For example, the one-line Stata command to implement the intention, "generate v = p|q" becomes "generate v = p|q if !mi(p,q)|(p&!mi(p))|(q&!mi(q))" And so forth. Such coding is a problem, not a feature—so solutions should be sought. One solution (really a work-around) introduces my command, validly, which allows expressions such as "validly generate v = p|q" and correctly, without fuss, interprets the logical or relational operators (here returning true if p is true but q indeterminate and indeterminate if p is false but q indeterminate). More generally, validly serves as a “wrapper†for any standard conditional command. So, for example, "validly reg a b c if p|q" is handled correctly. But validly (its code deploys nested calls to cond()) is computationally expensive. The better resolution would be for Stata, in its next release, to redesign its core code so that logical and relational operators would (as algebraic operators currently do) handle missing data appropriately. (Objections to this strategy are examined and deemed to lack force.) I would like to enlist the informed and active judgment of the participants of the 14th Users Group meeting to help bring this about.
Date: 2008-09-11
References: Add references at CitEc
Citations:
Downloads: (external link)
http://repec.org/usug2008/KIMacD.presentation.ppt presentation slides (application/x-ms-powerpoint)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:boc:usug08:01
Access Statistics for this paper
More papers in United Kingdom Stata Users' Group Meetings 2008 from Stata Users Group Contact information at EDIRC.
Bibliographic data for series maintained by Christopher F Baum ().