A large language model framework for sample-free population synthesis
Michael Jones,
Richard Dawson and
Jon Mills
PLOS ONE, 2026, vol. 21, issue 6, 1-1
Abstract:
Synthetic populations provide the demographic foundations for agent-based models in transport, public health, disaster management and other sectors, enabling credible representations of individual characteristics and behaviours. Many established synthesis methods rely on census microdata; however, such data are infrequently collected, privacy-restricted, and usually available only as small public-use samples at coarse geographic scales. This paper introduces a sample-free framework that uses a large language model (LLM) to generate complete, household-structured populations directly from aggregate demographic data. The framework is LLM agnostic and follows a multi-step process: objective definition, input preparation, LLM selection, and synthetic household generation. No model fine-tuning is required, meaning that data requirements are low and the framework is easily accessible. Population synthesis is formulated as an iterative prompting process in which an LLM generates households guided by the discrepancies between synthetic and target distributions. The model draws on prior knowledge encoded during pre-training to propose plausible attribute combinations, resulting in both statistical alignment and structural feasibility. In a global evaluation covering 109 countries, the framework achieved very close alignment on simpler marginals such as gender (SRMSE: 0.003) and household size (SRMSE: 0.026), while more structurally complex attributes such as household composition (SRMSE: 0.062) and age (SRMSE: 0.128) were also reproduced with good accuracy. These results were supported by detailed case studies in Newcastle upon Tyne (UK) and Dar es Salaam (Tanzania). The principal contribution of the framework is to enable the construction of coherent household-structured populations when detailed microdata are unavailable, expanding the applicability of agent-based modelling in data-constrained settings.
Date: 2026
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0341704 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 41704&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0341704
DOI: 10.1371/journal.pone.0341704
Access Statistics for this article
More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().