EconPapers    
Economics at your fingertips  
 

Squeezing More Out of Your Data: Business Record Linkage with Python

John Cuffe and Nathan Goldschlag

Working Papers from U.S. Census Bureau, Center for Economic Studies

Abstract: Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.

Pages: 37 pages
Date: 2018-11
New Economics Papers: this item is included in nep-big and nep-cmp
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (6)

Downloads: (external link)
https://www2.census.gov/ces/wp/2018/CES-WP-18-46.pdf First version, 2018 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:cen:wpaper:18-46

Access Statistics for this paper

More papers in Working Papers from U.S. Census Bureau, Center for Economic Studies Contact information at EDIRC.
Bibliographic data for series maintained by Dawn Anderson ().

 
Page updated 2025-03-30
Handle: RePEc:cen:wpaper:18-46