Between the Embedding and the Prompt: Systematic Design Effects in LLM-based Occupation Coding

Kononykhina, Olga; Haensch, Anna-Carolina; Kreuter, Frauke

Between the Embedding and the Prompt: Systematic Design Effects in LLM-based Occupation Coding

Olga Kononykhina, Anna-Carolina Haensch and Frauke Kreuter

No g6wjy_v1, SocArXiv from Center for Open Science

Abstract: Assigning free-text job descriptions to standardised taxonomies is a persistent bottleneck in survey research and official statistics. Large language models (LLMs) offer a promising path toward automation, but each step in the pipeline involves both model architecture and measurement choices about how an occupation should be represented. Through 119 experiments on German survey data, we systematically vary the textual representation of occupational categories, embedding models, LLMs, and prompt design. Category representation changes retrieval accuracy by 8–23 percentage points and classification by 11–21. Prompt role and abstention behaviour are model-specific and must be validated before deployment. The dominant source of variance, however, sits outside model measurement choices. How respondents describe their work matters more than any model or design choice (ICC = 0.76).

Date: 2026-05-19
New Economics Papers: this item is included in nep-ain and nep-exp
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://osf.io/download/6a0a36cfa587aeabd01465ae/

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:osf:socarx:g6wjy_v1

DOI: 10.31219/osf.io/g6wjy_v1

Access Statistics for this paper

More papers in SocArXiv from Center for Open Science
Bibliographic data for series maintained by OSF ().