Testing MLLM on the Medical Domain
Published in arxiv, 2024
This paper identifies two key areas for improvement in existing medical benchmark research. In text-based medical question answering (QA), open-ended QA (OpenQA) has not been thoroughly explored, with current evaluations relying heavily on machine translation metrics. For medical visual QA (VQA), previous studies have predominantly focused on knowledge-based QA or pattern recognition, often derived from literature or textbooks, lacking benchmarks that consider patient background diagnoses. To address these gaps, we propose a Chinese medical benchmark PatientQA, which comprises two components: an OpenQA segment that employs a step-by-step evaluation scheme for assessing LLM responses, and a multiple-choice VQA segment centered on patient diagnosis. Experiments involving over 10 LLMs reveal that many models struggle with diagnosing patients based on medical backgrounds. We believe our research provides essential benchmarks and insights that can guide the future development of multimodal medical QA systems, particularly in enhancing evaluation methods and addressing the complexities of medical contexts.
Recommended citation: Z Chen. (2025). "medicalVQA." arxiv 1. 1(1).
Download Paper | Download Slides | Download Bibtex
