OCRを利用した統計表の体系的なテキストデータ化, Textizing statistical tables using OCR at scale

有本, 寛; Arimoto, Yutaka

OCRを利用した統計表の体系的なテキストデータ化, Textizing statistical tables using OCR at scale

寛有本 and Yutaka Arimoto

No 2021-03, CEI Working Paper Series from Center for Economic Institutions, Institute of Economic Research, Hitotsubashi University

Abstract: 本稿は，OCRを利用して，統計表を体系的かつ大規模にテキストデータ化するための要件と方法を解説する．統計表をOCRでテキストデータ化するには，高い精度の表レイアウト解析が求められる．筆者が開発しているocrstatsは，バッチ処理，定型的な工程の自動化，外部OCRの利用，実用的な精度の表レイアウト解析を実現し，作業効率の改善を図っている．また，ocrstatsを使って『日本帝国統計年鑑』をテキストデータ化する過程で得られたノウハウや，パネルデータの作成にあたって変数を経年的にリンクする方法も解説する．, This paper describes the requirements and methods for textizing statistical tables using OCR at scale. The major challenge of textizing statistical tables by OCR is analyzing the table layout with high accuracy. I develop a Python tookit, ocrstats, that supports the task by providing batch processing, automation of routine processes, use of external OCR, and table layout analysis with practical accuracy. I also explain practical tips learnt from the process of textizing the Japan Imperial Statistical Yearbook using ocrstats.

Pages: 18 pages
Date: 2021-07
Note: 2021年7月21日
References: Add references at CitEc
Citations:

Downloads: (external link)
https://hit-u.repo.nii.ac.jp/record/2056456/files/wp2021-03.pdf

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:hit:hitcei:2021-03

Access Statistics for this paper

More papers in CEI Working Paper Series from Center for Economic Institutions, Institute of Economic Research, Hitotsubashi University Contact information at EDIRC.
Bibliographic data for series maintained by Reiko Suzuki ( this e-mail address is bad, please contact ).