Meet OpenAI Evals. Along with the release of GPT-4, OpenAI also released an open-source software framework for testing the efficacy of its AI models.
The OpenAI team has announced a new set of tools they’re calling Evals that will enable anyone to report problems with the company’s models and lead changes.
we are open-sourcing OpenAI Evals, our framework for automated evaluation of AI model performance, to allow anyone to help improve our models.
— Sam Altman (@sama) March 14, 2023
What is OpenAI Evals?
In a blog post, OpenAI describes this methodology as a “crowdsourcing approach” to validate models.
“We use Evals to guide development of our models (both identifying shortcomings and preventing regressions), and our users can apply it for tracking performance across model versions and evolving product integrations,” OpenAI writes. “We are hoping Evals becomes a vehicle to share and crowdsource benchmarks, representing a maximally wide set of failure modes and difficult tasks.”
-OpenAI
The goal of OpenAI’s Evals project is to construct and execute benchmarks that can be used to assess the efficacy of models like GPT-4 through careful analysis of their performance. With Evals, programmers can generate questions using datasets, evaluate the accuracy of an OpenAI model’s responses, and evaluate the efficacy of various datasets and models.
Evals is not just backward-compatible with several well-known AI benchmarks but also allows you to create new classes to use your own evaluation logic. To serve as a benchmark, OpenAI designed an evaluation of logic puzzles with 10 examples of problems with which GPT-4 struggles.
It’s all volunteer work, which is a huge bummer. Nonetheless, OpenAI intends to provide GPT-4 access to individuals who give “high-quality” benchmarks in order to encourage Evals usage.
“We believe that Evals will be an integral part of the process for using and building on top of our models, and we welcome direct contributions, questions, and feedback.”
-OpenAI
OpenAI, which announced it will stop utilizing consumer data to train its models by default, is joining the ranks of those that have turned to crowdsource in order to strengthen AI models using Evals.
Are you into GPT-4? Check out these: