EconPapers    
Economics at your fingertips  
 

Automated historical census digitization using image augmentation and transformer-based methods

Leonardo Costa Ribeiro, Jonatan Andersson, William Skoglund, Jakob Molinder and Martin Önnerfors
Additional contact information
Leonardo Costa Ribeiro: Federal University of Minas Gerais
Jonatan Andersson: Uppsala University
William Skoglund: Lund University
Jakob Molinder: Uppsala University
Martin Önnerfors: Uppsala University

No 298, Working Papers from European Historical Economics Society (EHES)

Abstract: A large literature in economic history uses digitized census data to study individual-level outcomes in history. Although many census records have been digitized manually, the process is extremely labor-intensive, and substantial material remains unprocessed in archives. Recent advances in machine learning offer the potential to automate large part of this work. We demonstrate an end-to-end digitization pipeline based on the transformer-based Donut model, trained on hand-annotated data and enhanced with image augmentation, to extract information from the 1955 Stockholm tax and census records. The resulting output attains high accuracy across multiple evaluation metrics.

Keywords: Digitization; Census; OCR; Transformers (search for similar items in EconPapers)
JEL-codes: N01 (search for similar items in EconPapers)
Date: 2026-02
New Economics Papers: this item is included in nep-big and nep-his
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://ehes.org/wp/EHES_298.pdf (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:hes:wpaper:0298

Access Statistics for this paper

More papers in Working Papers from European Historical Economics Society (EHES) Contact information at EDIRC.
Bibliographic data for series maintained by Christian Vedel ().

 
Page updated 2026-04-05
Handle: RePEc:hes:wpaper:0298