Performance of Large Language Models on Family Medicine Licensing Exams
מחמוד עומר, Kareem Hijazi, Girish N Nadkarni, Eyal Klang
מילות מפתח: Large language models (LLMs), Artificial intelligence (AI), primary care, licensing exam, prompt engineering.
רקע מדעי ומטרה:
Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality.
שיטות:
Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran’s Q and pairwise McNemar tests. A stress test of the top performer (openAI’s o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question).
תוצאות:
Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; ICC 0.776).
מסקנות:
Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.
חשיבות לרפואת המשפחה:
LLMs can meet passing standards on a family medicine licensing exam and provide coherent explanations, suggesting potential clinical decision support tools for family doctors while highlighting the need to manage hallucinations and biases.
#15

