The Impossible Testing Problem for AI Applications

by Dylan Huang on August 26, 2025

The shift: from unit tests to benchmarks in the LLM era

Traditional testing assumed deterministic, fast code. You could run thousands of unit tests in seconds because each test took microseconds-milliseconds and had a single correct answer.

Old world:
- 10,000 unit tests x 1 ms ≈ 10 seconds → easy to run on every commit.
- End-to-end tests were slower and run infrequently.

LLM-powered apps flip this: calls are slow (network + model), stochastic, and paid. They're also your core business logic.

New world:
- 1,000 LLM tests x 5-10 s ≈ 1.5-3 hours per run.
- 10,000 tests ≈ 14-28 hours, plus real API cost.
- Running that on every change is impossible.

The strategy shifts from exhaustive unit tests to benchmarks:

Curate fixed eval sets, score with clear metrics, track a baseline.
Run tiny “smoke” evals (5-20 cases) on every change.
Run full benchmarks nightly/weekly; gate major releases on deltas.
Monitor in production and refresh the benchmark set regularly.

Bottom line: Deterministic unit tests don't scale to LLMs. Benchmarks and periodic, statistical evaluation are the new way to ensure quality without freezing velocity.

Note: Continuously improving from production data—instrumentation, human/auto grading, error triage, and iterative prompt/model updates—is its own discipline: the evals flywheel. That's food for a separate discussion.

← Back to all posts

The Impossible Testing Problem for AI Applications

The shift: from unit tests to benchmarks in the LLM era​

The shift: from unit tests to benchmarks in the LLM era