Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data
Stephen Meisenbacher, 
Svetlozar Nestorov and 
Peter Norlander
MPRA Paper from  University Library of Munich, Germany
Abstract:
Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 – 2025. We illustrate the potential for research and future uses in education and workforce development.
Keywords: Labor Market Information; Online Job Vacancies; NLP methods; ML; data transparency (search for similar items in EconPapers)
JEL-codes: J23 J24 J63  (search for similar items in EconPapers)
Date: 2025-10-01
New Economics Papers: this item is included in nep-inv and nep-lma
References: View references in EconPapers View complete reference list from CitEc 
Citations: 
Downloads: (external link)
https://mpra.ub.uni-muenchen.de/126336/1/MPRA_paper_126336.pdf original version (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX 
RIS (EndNote, ProCite, RefMan) 
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:pra:mprapa:126336
Access Statistics for this paper
More papers in MPRA Paper  from  University Library of Munich, Germany Ludwigstraße 33, D-80539 Munich, Germany. Contact information at EDIRC.
Bibliographic data for series maintained by Joachim Winter ().