High Agreement, Different Stories: How LLM Classifiers Reshape Demographic Patterns in Survey Data

Soria, Chris

High Agreement, Different Stories: How LLM Classifiers Reshape Demographic Patterns in Survey Data

Chris Soria

No 85kyd_v1, SocArXiv from Center for Open Science

Abstract: What we learn from open-ended survey data depends on who—or what—does the coding. Large Language Models (LLMs) promise to democratize qualitative analysis, but do high agreement rates translate into equivalent thematic findings? This study compares eight LLMs to human annotators on a multilabel coding task using 3,200 responses from the UC Berkeley Social Networks Study, comprising over 19,000 coding decisions. Although LLM-human reliability does not match human-human reliability overall, LLMs approach human performance on simpler tasks and can serve as useful additional coders for generating consensus labels. Compared to a gold-standard human consensus, models achieve 82–97% per-category agreement, but macro F1 is lower and response-level similarity is lower still: even the best model reproduces the full human label set for fewer than 60% of responses. Yet high agreement masks thematic divergence. Models systematically over-identify themes, assigning 67% more categories per response, especially for categories requiring greater interpretive judgment. Models also show lower agreement for some demographic groups. These gaps are partly explained by response characteristics such as length, clarity, and atypicality, and some persist after controls, with implications for studies of populations whose response styles diverge from the corpus average. At the sample level, models largely preserve the overall thematic narrative: human and model category rankings correlate strongly (pooled Spearman's ρ=0.75), and top-performing models achieve approximately 80% directional agreement on demographic patterns. Concrete behavioral questions, such as reasons for moving or strategies for making friends, show especially strong alignment. Yet systematic over-classification can still shift narratives about how specific groups behave, leading researchers to report patterns that the human gold standard does not support.

Date: 2026-06-03
New Economics Papers: this item is included in nep-ain and nep-cmp
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://osf.io/download/6a1f995f75da36c10dffb87e/

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:osf:socarx:85kyd_v1

DOI: 10.31219/osf.io/85kyd_v1

Access Statistics for this paper

More papers in SocArXiv from Center for Open Science
Bibliographic data for series maintained by OSF ().