EconPapers    
Economics at your fingertips  
 

Digitizing historical balance sheet data: A practitioner’s guide

Sergio Correia and Stephan Luck

Explorations in Economic History, 2023, vol. 87, issue C

Abstract: This paper discusses how to successfully digitize large-scale historical micro-data by augmenting optical character recognition (OCR) engines with pre- and post-processing methods. Although OCR software has improved dramatically in recent years due to improvements in machine learning, off-the-shelf OCR applications still present high error rates which limit their applications for accurate extraction of structured information. Complementing OCR with additional methods can however dramatically increase its success rate, making it a powerful and cost-efficient tool for economic historians. This paper showcases these methods and explains why they are useful. We apply them against two large balance sheet datasets and introduce quipucamayoc, a Python package containing these methods in a unified framework.

Keywords: OCR; Data extraction; Balance sheets (search for similar items in EconPapers)
JEL-codes: C81 C88 N80 (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)

Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S0014498322000535
Full text for ScienceDirect subscribers only

Related works:
Working Paper: Digitizing Historical Balance Sheet Data: A Practitioner's Guide (2022) Downloads
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000535

DOI: 10.1016/j.eeh.2022.101475

Access Statistics for this article

Explorations in Economic History is currently edited by R.H. Steckel

More articles in Explorations in Economic History from Elsevier
Bibliographic data for series maintained by Catherine Liu ().

 
Page updated 2025-04-07
Handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000535