Introducing LiveBench: A New Benchmark for Evaluating Language Models in AI

LiveBench: A New AI Benchmark to Address Limitations

A team consisting of Nvidia, Abacus.ai, New York University, the University of Maryland, and the University of Southern California has developed a new benchmark called LiveBench that aims to address the limitations of existing industry benchmarks. LiveBench is a general-purpose benchmark for large language models (LLMs) that provides contamination-free test data. This means that the models are evaluated on fresh and diverse questions that are not part of their training set, minimizing the risk of artificially inflated performance.

The release of LiveBench is particularly notable because it involves contributions from renowned figures in the AI community. Yann LeCun, Chief AI Scientist at Meta and a pioneer in the field of AI, is one of the contributors. The team also includes researchers from Abacus.ai, Nvidia, and several academic institutions.

The Need for Better LLM Benchmarks

According to Micah Goldblum, one of the creators of LiveBench, the team recognized the need for better LLM benchmarks because existing ones did not align with their qualitative experience using LLMs. They wanted to build a benchmark that would generate fresh questions every time a model is evaluated, preventing test set contamination. With funding and support from Abacus.ai, the project evolved into a collaborative effort involving multiple institutions.

Challenges with Existing Benchmarks

The team behind LiveBench highlights the limitations of existing benchmarks for LLMs. Traditional benchmarks are often published on the internet and can be included in the training data of LLMs. As a result, the models’ performance on these benchmarks can be artificially inflated, rendering the benchmarks unreliable. Additionally, benchmarks that rely on human prompting and judging can introduce unconscious biases and errors.

The Features of LiveBench

LiveBench addresses these challenges by providing fresh questions sourced from various recent datasets and competitions. The benchmark covers a wide range of tasks, including math, coding, reasoning, language comprehension, instruction following, and data analysis. Each question has an objective ground-truth answer, allowing for accurate and automatic scoring without the need for human evaluators. LiveBench currently offers 960 questions, with new and more challenging inquiries being released monthly.

Categories and Difficulty Levels

LiveBench includes 18 tasks across six categories. The categories are math, coding, reasoning, language comprehension, instruction following, and data analysis. Each task varies in difficulty level, from easy to most challenging. The benchmark aims for top models to achieve a success rate of around 30% to 70%.

Accuracy of Top Models on LiveBench

The creators of LiveBench have evaluated numerous closed-source and open-source models with sizes ranging from 500 million to 110 billion tokens. They claim that top models achieve less than 60% accuracy on the benchmark. For example, OpenAI’s GPT-4o is currently at the top of the leaderboard with a global average score of 53.79.

Benefits for Enterprises

LiveBench provides business leaders with a reliable evaluation standard for LLMs. With the benchmark, they can compare models without worrying about contamination or biased evaluations. This helps in making informed decisions when using AI and developing AI-focused products.

Comparing LiveBench to Other Benchmarks

LiveBench has been compared to other prominent LLM benchmarks such as Chatbot Arena and Arena-Hard. While LiveBench shows generally similar trends to these benchmarks, there are variations in the individual LLM scoring. These differences could be attributed to factors like known bias or other unknown factors.

An Open-Source Benchmark for All

LiveBench is an open-source benchmark that anyone can use and contribute to. The creators plan to continue maintaining it by releasing new questions every month. They also intend to expand the benchmark by adding more categories and tasks in the future. The team believes in open science and encourages collaboration and contributions from the AI community.

Conclusion

LiveBench is a promising development in the field of AI benchmarks, addressing the limitations of existing benchmarks and providing a reliable evaluation standard for LLMs. With its fresh and diverse questions and objective scoring system, LiveBench offers a more accurate representation of an LLM’s capabilities. By making the benchmark open-source, the team aims to foster collaboration and support the development of better models and evaluation methods in the future.