OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment
Thomas Hegghammer ()
Additional contact information
Thomas Hegghammer: Norwegian Defence Research Establishment (FFI)
Journal of Computational Social Science, 2022, vol. 5, issue 1, No 36, 882 pages
Abstract:
Abstract Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.
Keywords: OCR; Cloud computing; Benchmarking (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)
Downloads: (external link)
http://link.springer.com/10.1007/s42001-021-00149-1 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:jcsosc:v:5:y:2022:i:1:d:10.1007_s42001-021-00149-1
Ordering information: This journal article can be ordered from
http://www.springer. ... iences/journal/42001
DOI: 10.1007/s42001-021-00149-1
Access Statistics for this article
Journal of Computational Social Science is currently edited by Takashi Kamihigashi
More articles in Journal of Computational Social Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().