Database Report: Twin-2K-500: A Data Set for Building Digital Twins of over 2,000 People Based on Their Answers to over 500 Questions
Olivier Toubia (),
George Z. Gui (),
Tianyi Peng (),
Daniel J. Merlau (),
Ang Li () and
Haozhe Chen ()
Additional contact information
Olivier Toubia: Marketing Division, Columbia Business School, Columbia University, New York, New York 10027
George Z. Gui: Marketing Division, Columbia Business School, Columbia University, New York, New York 10027
Tianyi Peng: Decision, Risk & Operations Division, Columbia Business School, Columbia University, New York, New York 10027
Daniel J. Merlau: Marketing Division, Columbia Business School, Columbia University, New York, New York 10027
Ang Li: Department of Computer Science, Columbia University, New York, New York 10025
Haozhe Chen: Department of Computer Science, Columbia University, New York, New York 10025
Marketing Science, 2025, vol. 44, issue 6, 1446-1455
Abstract:
Large language model (LLM)-based digital twin simulation, where LLMs are used to emulate individual human behavior, holds great promise for research in business, artificial intelligence, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real individual-level data sets that are both large and publicly available. To address this gap, we introduce a large-scale public data set designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of N = 2 , 058 participants (average 2.42 hours per person) in the United States across four waves with more than 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral economics experiments and a pricing survey. The final wave repeats tasks from earlier waves to establish a test-retest accuracy baseline. Initial analyses suggest the data are of high quality and show promise for constructing digital twins that predict human behavior well at the individual and aggregate levels. Beyond LLM applications, due to its unique breadth and scale, the data set also enables broad social science and business research, including studies of cross-construct correlations and heterogeneous treatment effects.
Keywords: generative AI; computational social science; digital twins; LLM-based persona simulation (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/mksc.2025.0262 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:ormksc:v:44:y:2025:i:6:p:1446-1455
Access Statistics for this article
More articles in Marketing Science from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().