Home ai Improving Legibility of AI Models: OpenAI’s New Algorithm Explains Itself Better

Improving Legibility of AI Models: OpenAI’s New Algorithm Explains Itself Better

The OpenAI researchers have developed a new algorithm that aims to improve the legibility of large language models (LLMs) like OpenAI’s GPT-4. This algorithm addresses the “legibility” problem, which refers to the lack of transparency in how AI models arrive at their answers. The goal is to establish trustworthiness in AI systems, especially as they are integrated into fields where incorrectness can have serious consequences, such as healthcare, law, and defense applications.

The algorithm is based on the “Prover-Verifier Game” concept, initially proposed by machine learning researchers at the University of Toronto and Vector Institute for Artificial Intelligence. In this game, two AI models, a “prover” and a “verifier,” try to outwit each other. The prover’s objective is to convince the verifier of a certain answer, regardless of its correctness, while the verifier’s goal is always to select the correct answer. This game encourages AI models to show their work and provide verifiable explanations for their answers.

OpenAI implemented the Prover-Verifier Game by using two GPT-4 models and having them engage in multiple rounds of the game, answering math word problems with known answers. The prover model was set up to be either “helpful” or “sneaky,” and the verifier model had to evaluate the prover’s answers based on its training data. Both models were retrained between rounds to improve their performance and explanation capabilities.

Human evaluators were used to rate the legibility of the prover model’s answers. After several rounds, the researchers found that the verifier model became better at resisting the persuasion techniques of the sneaky prover model, while the prover model improved its ability to explain its choices to human users.

The resulting algorithm optimizes LLMs for both correctness and legibility. OpenAI hopes that this work will contribute to the development of AI systems that are not only accurate but also transparently verifiable, enhancing trust and safety in real-world applications. It also has the potential to align future models that surpass human intelligence. However, as models become more intelligent, it may become challenging for humans to reliably evaluate the correctness of their completions.

Exit mobile version