Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, have been touted as being able to process and analyze vast amounts of data. However, recent research suggests that these models are not as effective as claimed. Two separate studies found that Gemini 1.5 Pro and 1.5 Flash struggled to answer questions about large datasets correctly, with accuracy rates ranging from 40% to 50% in document-based tests.
According to Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, the models can technically process long contexts but do not actually understand the content. The context window of a model refers to the input data it considers before generating output. Gemini’s context window is capable of taking in up to 2 million tokens, equivalent to 1.4 million words or two hours of video.
In one study, the researchers asked Gemini 1.5 Pro and 1.5 Flash to evaluate true/false statements about fiction books. The models had ingested the relevant book and had to determine the accuracy of the statements and explain their reasoning. The results showed that neither model achieved higher than random chance in terms of question-answering accuracy.
Another study tested the ability of Gemini 1.5 Flash to reason over videos. The model had to answer questions about the content in images paired with questions. The accuracy of Flash was found to be relatively low, with around 50% accuracy in transcribing handwritten digits from a slideshow of images.
These studies highlight the limitations of Google’s Gemini models and raise questions about the company’s claims regarding their capabilities. While the studies did not test the models with the full 2-million-token context, they suggest that the models may not be as effective as advertised.
Generative AI is facing increased scrutiny as businesses and investors become frustrated with its limitations. A recent survey found that many C-suite executives do not expect generative AI to bring substantial productivity gains and have concerns about potential mistakes and data compromises. Additionally, generative AI dealmaking in the earliest stages has declined significantly.
Google’s focus on Gemini’s context window as a differentiator may have been premature. The ability to process long documents and understand them is still an ongoing challenge in the field of generative AI. Better benchmarks and third-party critique are needed to address hyped-up claims and provide a more realistic evaluation of these models’ capabilities.
In conclusion, the research suggests that Google’s Gemini models may not live up to their promises of processing and analyzing large amounts of data effectively. The limitations of these models highlight the need for better benchmarks and third-party evaluation in the field of generative AI.