Large language models encode clinical knowledge
Karan Singhal (),
Shekoofeh Azizi (),
Tao Tu,
S. Sara Mahdavi,
Jason Wei,
Hyung Won Chung,
Nathan Scales,
Ajay Tanwani,
Heather Cole-Lewis,
Stephen Pfohl,
Perry Payne,
Martin Seneviratne,
Paul Gamble,
Chris Kelly,
Abubakr Babiker,
Nathanael Schärli,
Aakanksha Chowdhery,
Philip Mansfield,
Dina Demner-Fushman,
Blaise Agüera y Arcas,
Dale Webster,
Greg S. Corrado,
Yossi Matias,
Katherine Chou,
Juraj Gottweis,
Nenad Tomasev,
Yun Liu,
Alvin Rajkomar,
Joelle Barral,
Christopher Semturs,
Alan Karthikesalingam () and
Vivek Natarajan ()
Additional contact information
Karan Singhal: Google Research
Shekoofeh Azizi: Google Research
Tao Tu: Google Research
S. Sara Mahdavi: Google Research
Jason Wei: Google Research
Hyung Won Chung: Google Research
Nathan Scales: Google Research
Ajay Tanwani: Google Research
Heather Cole-Lewis: Google Research
Stephen Pfohl: Google Research
Perry Payne: Google Research
Martin Seneviratne: Google Research
Paul Gamble: Google Research
Chris Kelly: Google Research
Abubakr Babiker: Google Research
Nathanael Schärli: Google Research
Aakanksha Chowdhery: Google Research
Philip Mansfield: Google Research
Dina Demner-Fushman: National Library of Medicine
Blaise Agüera y Arcas: Google Research
Dale Webster: Google Research
Greg S. Corrado: Google Research
Yossi Matias: Google Research
Katherine Chou: Google Research
Juraj Gottweis: Google Research
Nenad Tomasev: DeepMind
Yun Liu: Google Research
Alvin Rajkomar: Google Research
Joelle Barral: Google Research
Christopher Semturs: Google Research
Alan Karthikesalingam: Google Research
Vivek Natarajan: Google Research
Nature, 2023, vol. 620, issue 7972, 172-180
Abstract:
Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
Date: 2023
References: Add references at CitEc
Citations: View citations in EconPapers (10)
Downloads: (external link)
https://www.nature.com/articles/s41586-023-06291-2 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:nature:v:620:y:2023:i:7972:d:10.1038_s41586-023-06291-2
Ordering information: This journal article can be ordered from
https://www.nature.com/
DOI: 10.1038/s41586-023-06291-2
Access Statistics for this article
Nature is currently edited by Magdalena Skipper
More articles in Nature from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().