Between the Embedding and the Prompt: Systematic Design Effects in LLM-based Occupation Coding
Olga Kononykhina,
Anna-Carolina Haensch and
Frauke Kreuter
No g6wjy_v1, SocArXiv from Center for Open Science
Abstract:
Assigning free-text job descriptions to standardised taxonomies is a persistent bottleneck in survey research and official statistics. Large language models (LLMs) offer a promising path toward automation, but each step in the pipeline involves both model architecture and measurement choices about how an occupation should be represented. Through 119 experiments on German survey data, we systematically vary the textual representation of occupational categories, embedding models, LLMs, and prompt design. Category representation changes retrieval accuracy by 8–23 percentage points and classification by 11–21. Prompt role and abstention behaviour are model-specific and must be validated before deployment. The dominant source of variance, however, sits outside model measurement choices. How respondents describe their work matters more than any model or design choice (ICC = 0.76).
Date: 2026-05-19
References: Add references at CitEc
Citations:
Downloads: (external link)
https://osf.io/download/6a0a36cfa587aeabd01465ae/
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:osf:socarx:g6wjy_v1
DOI: 10.31219/osf.io/g6wjy_v1
Access Statistics for this paper
More papers in SocArXiv from Center for Open Science
Bibliographic data for series maintained by OSF ().