From Job Titles to ISCO Codes: Enhancing Occupational Classification With RAG-based LLMs
Ruben L. Bach,
Christopher Klamm,
Stefanie Heyne,
Irena Kogan,
Olga Kononykhina and
Jana Jarck
No ge56f_v1, SocArXiv from Center for Open Science
Abstract:
Accurate occupational classification from open-ended survey responses is vital for research in sociology, economics, and political science, yet manual coding remains resource-intensive and difficult to scale. We propose a novel pipeline that leverages large language models (LLMs) augmented with retrieval (RAG) to automate the assignment of International Standard Classification of Occupations (ISCO) codes. Drawing on survey data from a sample of recently arrived Afghan and Syrian refugees in Germany, we preprocess noisy occupational descriptions using LLMs and apply vector-based similarity search to retrieve candidate ISCO codes. The final classification is selected by LLMs, constrained to the retrieved candidates and accompanied by interpretable justifications. We evaluate the system’s performance against expert-coded labels, demonstrating high agreement and robustness across languages. Our findings suggest that RAG-powered LLMs can substantially improve the accuracy, scalability, and accessibility of occupational classification, with particular benefits for multilingual and resource-constrained research settings. In addition, we describe a prototypical pipeline that other researchers can readily adapt for applying LLMs to similar classification tasks, facilitating transparency, reproducibility, and broader adoption.
Date: 2025-09-24
New Economics Papers: this item is included in nep-ain, nep-cmp and nep-inv
References: Add references at CitEc
Citations:
Downloads: (external link)
https://osf.io/download/68d3baf62b733bc779e8cfee/
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:osf:socarx:ge56f_v1
DOI: 10.31219/osf.io/ge56f_v1
Access Statistics for this paper
More papers in SocArXiv from Center for Open Science
Bibliographic data for series maintained by OSF ().