<aside> 💡
Key Learnings :
With LLMs, building demos has become easier but building systems that reliably solve user tasks has become even harder considering LLMs are non-deterministic!
I learned this the hard way while building AI product (A developer copilot) which uses AI/LLMs extensively for
https://www.youtube.com/watch?v=fCjdLPdzEnw
https://www.youtube.com/watch?v=MdF_5-DTjXo
End users don't care about our fancy AI architecture, they care about getting their job done, consistently and reliably and with non-deterministic LLMs, that's harder than it sounds. That's why we need a solid AI evaluation framework.
As a Product Manager, building AI products, I will be writing extensively about AI Evals. This post is the first in a series in which I talk about AI evaluation frameworks - or "evals" as we call them in product teams.
Here is why Evals matter :