Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition
Dipali Baviskar,
Swati Ahirrao and
Ketan Kotecha
Additional contact information
Dipali Baviskar: Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune 412115, India
Swati Ahirrao: Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune 412115, India
Ketan Kotecha: Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune 412115, India
Data, 2021, vol. 6, issue 7, 1-10
Abstract:
The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents.
Keywords: artificial intelligence (AI); information extraction; Named Entity Recognition (NER); unstructured data (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2021
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2306-5729/6/7/78/pdf (application/pdf)
https://www.mdpi.com/2306-5729/6/7/78/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:6:y:2021:i:7:p:78-:d:597539
Access Statistics for this article
Data is currently edited by Ms. Cecilia Yang
More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().