A Manager and an AI Walk into a Bar: Does ChatGPT Make Biased Decisions Like We Do?
Yang Chen (),
Samuel N. Kirshner (),
Anton Ovchinnikov (),
Meena Andiappan () and
Tracy Jenkin ()
Additional contact information
Yang Chen: Ivey Business School, Western University, London, Ontario N6G 0N1, Canada
Samuel N. Kirshner: University of New South Wales Business School, University of New South Wales, Sydney, New South Wales 2052, Australia
Anton Ovchinnikov: Smith School of Business, Queen’s University, Kingston, Ontario K7L 3N6, Canada; and INSEAD, 77300 Fontainebleau, France
Meena Andiappan: DeGroote School of Business, McMaster University, Hamilton, Ontario L8S 4M4, Canada; and Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario M5T 3M6, Canada
Tracy Jenkin: Smith School of Business, Queen’s University, Kingston, Ontario K7L 3N6, Canada; and Vector Institute, Toronto, Ontario M5G 0C6, Canada
Manufacturing & Service Operations Management, 2025, vol. 27, issue 2, 354-368
Abstract:
Problem definition : Large language models (LLMs) are being increasingly leveraged in business and consumer decision-making processes. Because LLMs learn from human data and feedback, which can be biased, determining whether LLMs exhibit human-like behavioral decision biases (e.g., base-rate neglect, risk aversion, confirmation bias, etc.) is crucial prior to implementing LLMs into decision-making contexts and workflows. To understand this, we examine 18 common human biases that are important in operations management (OM) using the dominant LLM, ChatGPT. Methodology/results : We perform experiments where GPT-3.5 and GPT-4 act as participants to test these biases using vignettes adapted from the literature (“standard context”) and variants reframed in inventory and general OM contexts. In almost half of the experiments, Generative Pre-trained Transformer (GPT) mirrors human biases, diverging from prototypical human responses in the remaining experiments. We also observe that GPT models have a notable level of consistency between the standard and OM-specific experiments as well as across temporal versions of the GPT-3.5 model. Our comparative analysis between GPT-3.5 and GPT-4 reveals a dual-edged progression of GPT’s decision making, wherein GPT-4 advances in decision-making accuracy for problems with well-defined mathematical solutions while simultaneously displaying increased behavioral biases for preference-based problems. Managerial implications : First, our results highlight that managers will obtain the greatest benefits from deploying GPT to workflows leveraging established formulas. Second, that GPT displayed a high level of response consistency across the standard, inventory, and non-inventory operational contexts provides optimism that LLMs can offer reliable support even when details of the decision and problem contexts change. Third, although selecting between models, like GPT-3.5 and GPT-4, represents a trade-off in cost and performance, our results suggest that managers should invest in higher-performing models, particularly for solving problems with objective solutions.
Keywords: large language models; decision biases; ChatGPT; behavioral operations management (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/msom.2023.0279 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:ormsom:v:27:y:2025:i:2:p:354-368
Access Statistics for this article
More articles in Manufacturing & Service Operations Management from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().