EconPapers    
Economics at your fingertips  
 

dtalink: Faster probabilistic record linking and deduplication methods in Stata for large data files

Keith Kranker

2018 Stata Conference from Stata Users Group

Abstract: Stata users often need to link records from two or more data files, or find duplicates within data files. Probabilistic linking methods are often used when the file(s) do not have reliable or unique identifiers, causing deterministic linking methods (such as Stata's merge or duplicates commands) to fail. For example, one might need to link files that only include inconsistently spelled names, dates of birth with typos or missing data, and addresses that change over time. Probabilistic linkage methods score each potential pair of records on the probability the two records match, so that pairs with higher overall scores indicate a better match than pairs with lower scores. Two user-written Stata commands for probabilistic linking exist (reclink and reclink2), but they do not scale efficiently. dtalink is a new program that offers streamlined probabilistic linking methods implemented in parallelized Mata code. Significant speed improvements make it practical to implement probabilistic linking methods on large, administrative data files (files with many rows or matching variables) and new features offer more flexible scoring and many-to-many matching techniques. The presentation introduces dtalink, discusses useful tips and tricks, and provides an example of linking Medicaid and birth certificates data.

Date: 2018-08-02
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://fmwww.bc.edu/repec/scon2018/columbus18_Kranker.pdf

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:boc:scon18:31

Access Statistics for this paper

More papers in 2018 Stata Conference from Stata Users Group Contact information at EDIRC.
Bibliographic data for series maintained by Christopher F Baum ().

 
Page updated 2025-03-19
Handle: RePEc:boc:scon18:31