Automated historical census digitization using image augmentation and transformer-based methods
Leonardo Costa Ribeiro,
Jonatan Andersson,
William Skoglund,
Jakob Molinder and
Martin Önnerfors
Additional contact information
Leonardo Costa Ribeiro: Federal University of Minas Gerais
Jonatan Andersson: Uppsala University
William Skoglund: Lund University
Jakob Molinder: Uppsala University
Martin Önnerfors: Uppsala University
No 298, Working Papers from European Historical Economics Society (EHES)
Abstract:
A large literature in economic history uses digitized census data to study individual-level outcomes in history. Although many census records have been digitized manually, the process is extremely labor-intensive, and substantial material remains unprocessed in archives. Recent advances in machine learning offer the potential to automate large part of this work. We demonstrate an end-to-end digitization pipeline based on the transformer-based Donut model, trained on hand-annotated data and enhanced with image augmentation, to extract information from the 1955 Stockholm tax and census records. The resulting output attains high accuracy across multiple evaluation metrics.
Keywords: Digitization; Census; OCR; Transformers (search for similar items in EconPapers)
JEL-codes: N01 (search for similar items in EconPapers)
Date: 2026-02
New Economics Papers: this item is included in nep-big and nep-his
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://ehes.org/wp/EHES_298.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:hes:wpaper:0298
Access Statistics for this paper
More papers in Working Papers from European Historical Economics Society (EHES) Contact information at EDIRC.
Bibliographic data for series maintained by Christian Vedel ().