Addressing the Problem of Factual Hallucinations with DataGemma Models
Google has introduced a new pair of open-source AI models called DataGemma to tackle the issue of hallucinations in large language models (LLMs). These models are prone to providing inaccurate answers, particularly when it comes to queries related to statistical data. The DataGemma models, available on Hugging Face for academic and research purposes, aim to enhance factual accuracy by leveraging real-world data from the Google-created Data Commons platform.
Understanding the Challenge of Factual Hallucinations
Despite the advancements in LLMs, hallucinations remain a significant problem. These models have revolutionized various applications, but they often struggle with numerical and statistical data, leading to inaccuracies. The causes of these hallucinations include the probabilistic nature of LLM generations and the lack of sufficient factual coverage in training data.
Traditional grounding approaches have not effectively addressed the issue, especially for statistical queries. Public statistical data is distributed in different schemas and formats, requiring proper context for accurate interpretation.
Two Approaches to Enhance Factual Accuracy
To overcome these challenges, Google researchers combined the Gemma family of language models with Data Commons, a vast repository of normalized public statistical data. They employed two distinct approaches: Retrieval Interleaved Generation (RIG) and retrieval augmented generation (RAG).
RIG involves comparing the original generation of the LLM with relevant statistics stored in Data Commons. The fine-tuned LLM produces natural language queries, which are converted into structured data queries and run against Data Commons to retrieve the accurate statistical answer. This approach improves factual accuracy by correcting the LLM generation with relevant citations.
RAG, on the other hand, uses the original statistical question to extract relevant variables and generate a natural language query for Data Commons. The query is then executed against the database to fetch the appropriate stats or tables. The extracted values, along with the original user query, prompt a long-context LLM to provide a highly accurate final answer.
Significant Improvements in Early Tests
In tests conducted on a set of 101 queries, DataGemma models fine-tuned with RIG demonstrated a substantial improvement in factuality compared to baseline models. The factuality of the baseline models ranged from 5% to 17%, while the DataGemma models achieved approximately 58% factuality.
DataGemma models fine-tuned with RAG were also able to answer a significant portion of queries with statistical responses from Data Commons. Although the LLMs were generally accurate with numbers (99%), they struggled to draw correct inferences from these numbers in 6% to 20% of cases.
Both RIG and RAG have their strengths and weaknesses. RIG is faster but less detailed, providing individual statistics and verification. RAG offers more comprehensive data but has limitations related to data availability and the need for large context-handling capabilities.
The Future of DataGemma and Further Research
Google intends to refine the methodologies of RIG and RAG as it continues its research. The company plans to scale up this work, subject it to rigorous testing, and integrate these enhanced functionalities into the Gemma and Gemini models. Initially, this will be done through a phased, limited-access approach.
The release of DataGemma with RIG and RAG to the public aims to inspire further research and the development of stronger, better-grounded models. By addressing the problem of factual hallucinations, these models have the potential to improve the accuracy of AI systems in handling statistical queries, benefiting research and decision-making processes.
In conclusion, Google’s DataGemma models represent a significant step forward in mitigating the issue of factual hallucinations in AI language models. By leveraging real-world data and incorporating retrieval-based and generation-based approaches, these models demonstrate improved accuracy in handling statistical queries. As further research and development are conducted, we can expect even more refined and reliable AI models in the future.