Home ai Automated Evaluation Platform Detects Errors in Large Language Models

Automated Evaluation Platform Detects Errors in Large Language Models

Automated Evaluation Platform Detects Errors in Large Language Models

San Francisco-based startup Patronus AI has secured $17 million in Series A funding to develop an automated evaluation platform that can identify mistakes in large language models (LLMs). The funding round, led by Glenn Solomon at Notable Capital, brings Patronus AI’s total funding to $20 million. The platform, developed by former Meta machine learning (ML) experts Anand Kannappan and Rebecca Qian, uses proprietary AI to detect errors such as hallucinations, copyright infringement, and safety violations in LLM outputs. It also enables granular benchmarking and stress tests models with adversarial examples.

The Dark Side of Generative AI: Hallucinations, Copyright Violations, and Safety Risks

Patronus AI’s platform is designed to catch a range of mistakes in LLMs, including hallucinations and copyright and safety-related risks. In an interview with VentureBeat, Kannappan explained that the platform can also address enterprise-specific capabilities such as style and tone of voice. These concerns have become more prominent as companies rush to implement generative AI models like OpenAI’s GPT-4o and Meta’s Llama 3. However, high-profile failures, such as CNET publishing error-riddled AI-generated articles and drug discovery startups retracting research papers based on LLM-hallucinated molecules, have highlighted the need for more accurate and safe LLMs.

Patronus AI’s Groundbreaking Research Reveals LLM Deficiencies

Patronus AI has published research that exposes deficiencies in the accuracy and safety of leading LLMs. For example, their “FinanceBench” benchmark found that the best-performing model answered only 19% of financial questions correctly after analyzing an entire annual report. Another experiment using the “CopyrightCatcher” API revealed that open-source LLMs reproduced copyrighted text verbatim in 44% of outputs. Qian, the CTO of Patronus AI, emphasized the risks of copyright infringement and the importance of addressing these issues, especially for large publishers and media companies.

Differentiating Factors and Industry Adoption

While other startups are also building tools for LLM evaluation, Patronus AI believes its research-first approach and deep expertise set it apart. The company’s core technology is based on training evaluation models that can identify edge cases where a given LLM is likely to fail. Several Fortune 500 companies in industries like automotive, education, finance, and software have already adopted Patronus AI’s platform to deploy LLMs safely within their organizations. With the new funding, Patronus plans to scale up its research, engineering, and sales teams while developing additional industry benchmarks.

The Future of LLM Evaluation and Deployment

Patronus AI envisions a future where rigorous automated evaluation of LLMs becomes standard for enterprises deploying the technology. Similar to security audits that paved the way for widespread cloud adoption, testing models with Patronus AI could become as common as unit-testing code. The platform is domain-agnostic, which means it can be extended to any industry, such as legal or healthcare. The goal is to enable enterprises across all industries to leverage the power of LLMs while ensuring the models are safe and aligned with specific use case requirements. However, validating LLM performance remains a challenge due to the black-box nature of foundation models and the vast space of possible outputs. Patronus AI aims to advance the state-of-the-art in AI evaluation and accelerate the path to accountable real-world deployment by catching mistakes in a reliable and scalable way that manual testing cannot achieve.

Exit mobile version