Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
Yusong Ke,
Hongru Lin,
Yuting Ruan,
Junya Tang () and
Li Li
Additional contact information
Yusong Ke: School of Electronic and Information Engineering, Tongji University, Shanghai 201804, China
Hongru Lin: School of Electrical and Information Engineering, Hunan University, Changsha 410082, China
Yuting Ruan: School of Economics and Management, Fuzhou University, Fuzhou 350100, China
Junya Tang: School of Computer Science and Technology, Tongji University, Shanghai 200070, China
Li Li: School of Electronic and Information Engineering, Tongji University, Shanghai 201804, China
Mathematics, 2025, vol. 13, issue 9, 1-17
Abstract:
Large language models (LLMs) are increasingly adopted in medical question answering (QA) scenarios. However, LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) is now recognized as a robust framework within the broader domain of machine learning, offering statistically rigorous guarantees of marginal (average) coverage for prediction sets. However, the applicability of CP in medical QA remains to be explored. To address this limitation, this study proposes an enhanced CP framework for medical multiple-choice question answering (MCQA) tasks. The enhanced CP framework associates the non-conformance score with the frequency score of the correct option. The framework generates multiple outputs for the same medical query by leveraging self-consistency theory. The proposed framework calculates the frequency score of each option to address the issue of limited access to the model’s internal information. Furthermore, a risk control framework is incorporated into the enhanced CP framework to manage task-specific metrics through a monotonically decreasing loss function. The enhanced CP framework is evaluated on three popular MCQA datasets using off-the-shelf LLMs. Empirical results demonstrate that the enhanced CP framework achieves user-specified average (or marginal) error rates on the test set. Moreover, the results show that the test set’s average prediction set size (APSS) decreases as the risk level increases. It is concluded that it is a promising evaluation metric for the uncertainty of LLMs.
Keywords: large language models; conformal prediction; medical multiple-choice question answering; average prediction set size (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/9/1538/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/9/1538/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:9:p:1538-:d:1650879
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().