EconPapers    
Economics at your fingertips  
 

A New Algorithm to Efficiently Match U.S. Census Records and Balance Representativity with Match Quality

Eric S. M. Protzer (), Sultan Orazbayev, Andres Gomez-Lievano (), Matte Hartog () and Frank Neffke
Additional contact information
Eric S. M. Protzer: Center for Global Development
Andres Gomez-Lievano: Center for International Development at Harvard University
Matte Hartog: Center for International Development at Harvard University

No 238, Growth Lab Working Papers from Harvard's Growth Lab

Abstract: We introduce a record linkage algorithm that allows one to (1) efficiently match hundreds of millions of records based not just on demographic characteristics but also name similarity, (2) make statistical choices regarding the trade-off between match quality and representativity and (3) automatically generate a ground truth of true and false matches, suitable for training purposes, based on networked family relationships. Given the recent availability of hundreds of millions of digitized census records, this algorithm significantly reduces computational costs to researchers while allowing them to tailor their matching design towards their research question at hand (e.g. prioritizing external validity over match quality). Applied to U.S Census Records from 1850 to 1940, the algorithm produces two sets of matches, one designed for representativity and one designed to maximize the number of matched individuals. At the same level of accuracy as commonly used methods, the algorithm tends to have a higher level of representativity and a larger pool of matches. The algorithm also allows one to match harder-to-match groups with less bias (e.g. women whose names tend to change over time due to marriage).

Keywords: U.S. Census; Machine Learning; Network Science (search for similar items in EconPapers)
Date: 2024-12
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://growthlab.hks.harvard.edu/sites/projects.i ... 8-ipums_matching.pdf (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:glh:wpfacu:238

Access Statistics for this paper

More papers in Growth Lab Working Papers from Harvard's Growth Lab
Bibliographic data for series maintained by Chuck McKenney ().

 
Page updated 2025-03-30
Handle: RePEc:glh:wpfacu:238