This is second in the series of articles I plan to write about my learnings of Building an AI product, defining evals, iterating through solutions and finally demonstrating impact on the customer & business. If you have not read the first article, start here.
<aside>
đź’ˇ
- Don't let jargon like "Evals, LLM-as-a-judge" intimidate you! Whether you are a product manager, designer, builder or a business person you should know if “your AI product working or not”
- Here is how I as a AI PM, created Evals system for the AI product I was working on (explained without any jargons, I promise!)
</aside>
First, why evals?
AI is non-deterministic but your product and customers who are using it need “reliability” in doing their tasks! Evals are a critical part of AI product development and Product Managers should be the internal champions of rigorous evals!
It is a systematic way to measure and communicate AI product quality to your customers and teams.
How to do Evals

- Step 1 : Curate your Datasets (Test cases, Expected output)
- Step 2 : Define your Evaluation Methodology & parameters :
- Define how will anyone compare (Ideal output vs AI output)
- Define Evaluation parameters
- What aggregated metrics you will convey from evaluation?
- Step 3 : Perform Evaluation
- Manual (human judge) - (start here!)
- Automated (LLM-as-a-judge) - (More on this topic later!)
- Step 4 : Analyse Error Patterns :
- Error Patterns : why test-cases failed and tag them
- LLM not getting all the context?
- Context not in right format?
- Customer inputs are not good!
- more…
- Step 6 : Report Evaluation Metrics
Example : How I did Evals for Developer Copilot, a coding agent