Skip to main navigation Skip to search Skip to main content

Benchmarking Large Language Models on CMExam - A Comprehensive Chinese Medical Exam Dataset

  • Junling Liu
  • , Peilin Zhou
  • , Yining Hua
  • , Dading Chong
  • , Zhongyu Tian
  • , Andrew Liu
  • , Helin Wang
  • , Chenyu You
  • , Zhenhua Guo
  • , Lei Zhu
  • , Michael Lingzhi Li
  • Alibaba Group Holding Ltd.
  • Hong Kong University of Science and Technology
  • Harvard University
  • Boston Children's Hospital
  • Peking University
  • Second Affiliated Hospital of Zhejiang University School of Medicine
  • Johns Hopkins University
  • Tianyi Traffic Technology
  • Ant Group

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

59 Scopus citations

Abstract

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
EditorsA. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, S. Levine
PublisherNeural information processing systems foundation
ISBN (Electronic)9781713899921
StatePublished - 2023
Event37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States
Duration: Dec 10 2023Dec 16 2023

Publication series

NameAdvances in Neural Information Processing Systems
Volume36
ISSN (Print)1049-5258

Conference

Conference37th Conference on Neural Information Processing Systems, NeurIPS 2023
Country/TerritoryUnited States
CityNew Orleans
Period12/10/2312/16/23

Fingerprint

Dive into the research topics of 'Benchmarking Large Language Models on CMExam - A Comprehensive Chinese Medical Exam Dataset'. Together they form a unique fingerprint.

Cite this