This month, the Klaviyo Data Science Podcast welcomes Evan Miller to deliver a seminar on his recently published paper, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations! This episode is a mix of a live seminar Evan gave to the team at Klaviyo and an interview we conducted with him afterward.
Suppose you’re trying to understand the performance of an AI model — maybe one you built or fine-tuned and are comparing to state-of-the-art models, maybe one you’re considering loading up and using for a project you’re about to start. If you look at the literature today, you can get a sense of what the average performance for the model is on an evaluation or set of tasks. But often, that’s unfortunately the extent of what it’s possible to learn —there is much less emphasis placed on the variability or uncertainty inherent to those estimates. And as anyone who’s worked with a statistical model in the past can affirm, variability is a huge part of why you might choose to use or discard a model.
This seminar explores how to best compute, summarize, and display estimates of variability for AI models. Listen along to hear about topics like:
About Evan Miller
You may already know our guest Evan Miller from his fantastic blog, which includes his celebrated A/B testing posts, such as “How not to run an A/B test.” You may also have used his A/B testing tools, such as the sample size calculator. Evan currently works as a research scientist at Anthropic.
About Anthropic
Per Anthropic’s website:
You can find more information about Anthropic, including links to their social media accounts, on the company website.
Anthropic is an AI safety and research company based in San Francisco. Our interdisciplinary team has experience across ML, physics, policy, and product. Together, we generate research and create reliable, beneficial AI systems.
Special thanks to Chris Murphy at Klaviyo for organizing this seminar and making this episode possible!
For the full show notes, including who's who, see the Medium writeup.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More