Improving the representativeness of a simple random sample: an optimization model and its application to the Continuous Sample of Working Lives
Juan Manuel Pérez-Salamero González,
Marta Regúlez-Castillo and
Carlos Vidal-Melia ()
Additional contact information
Juan Manuel Pérez-Salamero González: Department of Financial Economics and Actuarial Science, Faculty of Economics, University of Valencia, Valencia. (Spain).
Marta Regúlez-Castillo: Department of Applied Economics III (Econometrics and Statistics), Faculty of Economics and Business, University of the Basque Country UPV/EHU, Bilbao. (Spain).
No 2019-20, Documentos de Trabajo del ICAE from Universidad Complutense de Madrid, Facultad de Ciencias Económicas y Empresariales, Instituto Complutense de Análisis Económico
This paper develops an optimization model for selecting a large subsample that improves the representativeness of a simple random sample previously obtained from a population larger than the population of interest. The problem formulation involves convex mixed-integer nonlinear programming (convex MINLP) and is therefore NP-hard. However, the solution is found by maximizing the “constant of proportionality” – in other words, maximizing the size of the subsample taken from a stratified random sample with proportional allocation – and restricting it to a p-value high enough to achieve a good fit to the population of interest using Pearson’s chi-square goodness-of-fit test. The beauty of the model is that it gives the user the freedom to choose between a larger subsample with a poorer fit and a smaller subsample with a better fit. The paper also applies the model to a real case: The Continuous Sample of Working Lives (CSWL), which is a set of anonymized microdata containing information on individuals from Spanish Social Security records. Several waves (2005-2017) are first examined without using the model and the conclusion is that they are not representative of the target population, which in this case is people receiving a pension income. The model is then applied and the results prove that it is possible to obtain a large dataset from the CSWL that (far) better represents the pensioner population for each of the waves analysed.
Keywords: Optimization; Subsampling; Chi-square test; P-value, Continuous Sample of Working Lives. (search for similar items in EconPapers)
JEL-codes: C61 C81 C12 H55 J26 (search for similar items in EconPapers)
Pages: 30 pages
References: View references in EconPapers View complete reference list from CitEc
Citations: Track citations by RSS feed
Downloads: (external link)
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:ucm:doicae:1920
Ordering information: This working paper can be ordered from
Facultad de Ciencias Económicas y Empresariales. Pabellón prefabricado, 1ª Planta, ala norte. Campus de Somosaguas, 28223 - POZUELO DE ALARCÓN (MADRID)
https://www.ucm.es/f ... -de-trabajo-del-icae
Access Statistics for this paper
More papers in Documentos de Trabajo del ICAE from Universidad Complutense de Madrid, Facultad de Ciencias Económicas y Empresariales, Instituto Complutense de Análisis Económico Contact information at EDIRC.
Bibliographic data for series maintained by Águeda González Abad ().