Creating Data from Unstructured Text with Context Rule Assisted Machine Learning (CRAML)
Stephen Meisenbacher and
Peter Norlander
No 1214, GLO Discussion Paper Series from Global Labor Organization (GLO)
Abstract:
Popular approaches to building data from unstructured text come with limitations, such as scalability, interpretability, replicability, and real-world applicability. These can be overcome with Context Rule Assisted Machine Learning (CRAML), a method and no-code suite of software tools that builds structured, labeled datasets which are accurate and reproducible. CRAML enables domain experts to access uncommon constructs within a document corpus in a low-resource, transparent, and flexible manner. CRAML produces document-level datasets for quantitative research and makes qualitative classification schemes scalable over large volumes of text. We demonstrate that the method is useful for bibliographic analysis, transparent analysis of proprietary data, and expert classification of any documents with any scheme. To demonstrate this process for building data from text with Machine Learning, we publish open-source resources: the software, a new public document corpus, and a replicable analysis to build an interpretable classifier of suspected "no poach" clauses in franchise documents.
Keywords: machine learning; natural language processing; text classification; big data (search for similar items in EconPapers)
JEL-codes: B41 C38 C81 C88 J08 J41 J42 J47 J53 Z13 (search for similar items in EconPapers)
Date: 2022
New Economics Papers: this item is included in nep-big and nep-cmp
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.econstor.eu/bitstream/10419/267553/1/GLO-DP-1214.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:zbw:glodps:1214
Access Statistics for this paper
More papers in GLO Discussion Paper Series from Global Labor Organization (GLO) Contact information at EDIRC.
Bibliographic data for series maintained by ZBW - Leibniz Information Centre for Economics ().