Large language models encode clinical knowledge

原创

qq5f2bca2ab6e17 2023-08-11 00:24:03 ©著作权

文章标签 lua ci sed 文章分类 copilot AIGC

©著作权归作者所有：来自51CTO博客作者qq5f2bca2ab6e17的原创作品，请联系作者获取转载授权，否则将追究法律责任

Large language models (LLMs) have demonstrated impressive capabilities, but the

bar for clinical applications is high. Attempts to assess the clinical knowledge of

models typically rely on automated evaluations based on limited benchmarks. Here,

to address these limitations, we present MultiMedQA, a benchmark combining six

existing medical question answering datasets spanning professional medicine,

research and consumer queries and a new dataset of medical questions searched

online, HealthSearchQA. We propose a human evaluation framework for model

answers along multiple axes including factuality, comprehension, reasoning, possible

harm and bias. In addition, we evaluate Pathways Language Model1

(PaLM, a 540-billion

parameter LLM) and its instruction-tuned variant, Flan-PaLM2

on MultiMedQA. Using

a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy

on every MultiMedQA multiple-choice dataset (MedQA3

, MedMCQA4

, PubMedQA5

and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6

including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions),

surpassing the prior state of the art by more than 17%. However, human evaluation

reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameterefficient approach for aligning LLMs to new domains using a few exemplars. The

resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians.

We show that comprehension, knowledge recall and reasoning improve with model

scale and instruction prompt tuning, suggesting the potential utility of LLMs in

medicine. Our human evaluations reveal limitations of today’s models, reinforcing

the importance of both evaluation frameworks and method development in creating

safe, helpful LLMs for clinical applications.

上一篇：Bring Your Own Data! Self-Supervised Evaluation of Large Language Models

下一篇：Shorter and Faster Post-Quantum Designated-Verifier zkSNARKs from Lattices

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯