Home ai The Hidden Blindness of “Multi-Modal” Language Models: Why AI Can’t Really See

The Hidden Blindness of “Multi-Modal” Language Models: Why AI Can’t Really See

The latest round of language models, including GPT-4o and Gemini 1.5 Pro, are being advertised as “multi-modal,” capable of understanding images and audio in addition to text. However, a recent study has revealed that these models do not truly see in the way humans do, and in some cases, they may not see at all.

The marketing and benchmarks used by these companies imply that the models have vision capabilities and can analyze images and videos. They claim that the models can perform tasks such as solving homework problems or watching a game. While these claims are carefully phrased, they still suggest that the models have some form of visual understanding. In reality, the models rely on matching patterns in input data to patterns in their training data, similar to how they perform math or generate stories. As a result, these models fail in tasks that seem trivial to humans, such as picking a random number.

Researchers from Auburn University and the University of Alberta conducted a systematic study to assess the visual understanding of current AI models. They presented the largest multimodal models with simple visual tasks that even a first-grader could easily complete. These tasks included determining if two shapes overlap, counting the number of pentagons in a picture, or identifying a circled letter in a word. Surprisingly, the AI models struggled greatly with these tasks, despite their simplicity.

The researchers found that the models performed poorly on tasks like determining overlap between shapes. Even when presented with circles at close distances, the models could only provide correct answers a fraction of the time. Similarly, when asked to count interlocking circles in an image, the models achieved perfect accuracy when there were five rings but failed when an additional ring was added. These inconsistencies demonstrated that the models’ performance did not align with human expectations of visual understanding.

The researchers concluded that these models do not truly “see” as humans do. Their inability to perform elementary reasoning about images highlights their limitations in specific areas. However, it is important to note that these models excel in other tasks, such as interpreting human actions and expressions or identifying everyday objects in photos. Their intended use cases are where they are likely to be highly accurate.

While the marketing materials from AI companies may create the impression that these models possess comprehensive visual capabilities, research like this study is necessary to shed light on their actual limitations. It is crucial to understand that even though these models can accurately identify certain visual cues or patterns, they lack a true understanding of what they are “looking” at. Their responses are based on abstract and approximate information extracted from images, akin to someone who is informed about an image but cannot actually see it.

In conclusion, while these “visual” AI models have their limitations, they still hold immense value in various applications. They may not possess the same level of visual understanding as humans, but they excel in specific tasks within their intended scope. Understanding the true capabilities of these models requires a nuanced perspective that goes beyond the marketing claims and delves into the intricacies of their training and decision-making processes.

Exit mobile version