by Dylan Huang on August 26, 2025
Traditional testing assumed deterministic, fast code. You could run thousands of unit tests in seconds because each test took microseconds-milliseconds and had a single correct answer.
LLM-powered apps flip this: calls are slow (network + model), stochastic, and paid. They're also your core business logic.
The strategy shifts from exhaustive unit tests to benchmarks:
Bottom line: Deterministic unit tests don't scale to LLMs. Benchmarks and periodic, statistical evaluation are the new way to ensure quality without freezing velocity.
Note: Continuously improving from production data—instrumentation, human/auto grading, error triage, and iterative prompt/model updates—is its own discipline: the evals flywheel. That's food for a separate discussion.