How I built an

This is second in the series of articles I plan to write about my learnings of Building an AI product, defining evals, iterating through solutions and finally demonstrating impact on the customer & business. If you have not read the first article, start here.

<aside> 💡

Don't let jargon like "Evals, LLM-as-a-judge" intimidate you! Whether you are a product manager, designer, builder or a business person you should know if “your AI product working or not”
Here is how I as a AI PM, created Evals system for the AI product I was working on (explained without any jargons, I promise!) </aside>

First, why evals?

AI is non-deterministic but your product and customers who are using it need “reliability” in doing their tasks! Evals are a critical part of AI product development and Product Managers should be the internal champions of rigorous evals!

It is a systematic way to measure and communicate AI product quality to your customers and teams.

How to do Evals

CleanShot 2025-08-09 at 16.31.53@2x.png

Step 1 : Curate your Datasets (Test cases, Expected output)
Step 2 : Define your Evaluation Methodology & parameters :
- Define how will anyone compare (Ideal output vs AI output)
- Define Evaluation parameters
- What aggregated metrics you will convey from evaluation?
Step 3 : Perform Evaluation
- Manual (human judge) - (start here!)
- Automated (LLM-as-a-judge) - (More on this topic later!)
Step 4 : Analyse Error Patterns :
- Error Patterns : why test-cases failed and tag them
  - LLM not getting all the context?
  - Context not in right format?
  - Customer inputs are not good!
  - more…
Step 6 : Report Evaluation Metrics

First, why evals?

How to do Evals

Example : How I did Evals for Developer Copilot, a coding agent