What Do We Mean When We Say Evals?
A practical framing of AI evals — what to evaluate across model, agent, and application layers, when to run them, and how to turn vibe coding into engineering.
All the articles I've posted.
A practical framing of AI evals — what to evaluate across model, agent, and application layers, when to run them, and how to turn vibe coding into engineering.
Walking through spek, a small LLM-powered coding agent that turns a markdown spec into a working, tested Python package — and the six design questions I had to answer to build it.
An exploration of what ChatGPT's code execution environment can and can't do — filesystem access, process introspection, networking, and the curious 'prove it' prompting pattern.