Textizing Statistical Tables using OCR at Scale
Yutaka Arimoto
Economic Review, 2022, vol. 73, issue 1, 15-28
Abstract:
This study describes the requirements and methods for textizing statistical tables using OCR(optical character recognition)at scale. A major challenge of textizing statistical tables using OCR is analyzing the table layout with high accuracy. I develop a Python toolkit, ocrstats, which supports the task by providing batch processing, automation of routine processes, use of external OCR, and table layout analysis with practical accuracy. In addition, I explain the practical tips learned from the process of textizing the Japan Imperial Statistical Yearbook using ocrstats.
JEL-codes: Y1 (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations:
Downloads: (external link)
https://hermes-ir.lib.hit-u.ac.jp/hermes/ir/re/72558/keizaikenkyu07301015.pdf
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:hit:ecorev:v:73:y:2022:i:1:p:15-28
DOI: 10.15057/72558
Access Statistics for this article
More articles in Economic Review from Hitotsubashi University Contact information at EDIRC.
Bibliographic data for series maintained by Digital Resources Section, Hitotsubashi University Library ().