The Limitations of AI in Medical Diagnosis: A Study Uncovers Alarming Drops in Performance

Large language models (LLMs) and large multimodal models (LMMs) are increasingly being used in medical settings. However, a recent study conducted by researchers at the University of California at Santa Cruz and Carnegie Mellon University has revealed that these models are not yet reliable in high-stakes, real-world scenarios, particularly in medical diagnosis.

In the study, the researchers created a new dataset called ProbMed, which consisted of 6,303 X-ray, MRI, and CT scan images of various organs and areas of the body. They then asked state-of-the-art models, including GPT-4V and Gemini Pro, a series of diagnostic questions based on the images.

The results were alarming. Even the most advanced models performed as poorly as random guesses when asked to identify conditions and positions. Additionally, introducing slight perturbations to the questions significantly reduced model accuracy. On average, accuracy dropped by 42% across the tested models.

The researchers found that LMMs excel at recognizing image modality and organs but struggle when asked more specialized diagnostic questions. GPT-4V and Gemini Pro, in particular, had low accuracy in identifying conditions and findings. GPT-4V tended to reject challenging questions and deny ground-truth conditions, while Gemini Pro was prone to accepting false conditions and positions.

Specialized models like CheXagent, which is trained exclusively on chest X-rays, showed better accuracy in determining abnormalities and conditions. However, they struggled with general tasks such as identifying organs.

These findings have raised concerns within the research and medical community. Many experts agree that AI is not yet ready to support medical diagnosis and that current LLMs are far from applicable in critical fields like medicine. The study emphasizes the need for more robust evaluation methodologies to ensure the reliability and accuracy of LMMs in real-world medical applications.

In conclusion, while LLMs and LMMs hold promise for medical diagnosis, they are currently not reliable enough to be used in high-stakes, real-world scenarios. Further research and development are needed to improve the accuracy and reliability of these models before they can be safely deployed in critical fields like medicine.