The Untold System Design Problem in LLM Inference

by Dylan Huang on November 7, 2025

From fixed-shape compute to variable-shape, memory-coupled inference

For years, production ML looked like images or fixed-duration signals—tight, predictable tensors. You sized for FLOPs, picked a batch size, and rode the throughput curve.

LLMs changed the physics: inputs and outputs vary; attention cost grows with sequence length; decoding accumulates a KV cache that ties compute to memory capacity and bandwidth. Prefill and decode are different workloads. Add MoE routing and quantization, and the hot path keeps moving.

Old world:
- Fixed shapes, stable FLOPs, single dominant bottleneck (compute)
- Larger batches → higher throughput (until a rare, clear memory cap)
- One-time kernel tuning; simple queueing (arrival rate, service time)
New world:
- Variable prompt and response lengths; prefill vs. decode have distinct profiles
- Compute- and memory-bound at once; KV cache growth shifts the limit over time
- Dynamic batching/padding/truncation trade latency for throughput on the fly
- State management: cache residency/eviction, cross-request reuse
- MoE: per-token expert routing skews load; hotspots drift; cold experts thrash
- Quantization: 2/3/4/8-bit trade speed, quality, and kernel availability
- Many first-order variables: length distributions, concurrency, cache hit rate, expert sparsity, scheduler policy

The knobs move with the workload, not just the hardware. Kernel block sizes, KV cache configs, sequence/tensor parallelism, and schedulers (batching windows, priority, admission control) must adapt to what’s actually arriving.

Optimization is also use-case dependent: interactive chatbots target low TTFT/p50; large async/batch jobs target max throughput and cost efficiency. That means different batching windows, prioritization of prefill vs. decode, and distinct quantization/parallelism choices per tier.

What to do:

Instrument length and cache-hit distributions; set SLOs by percentiles
Separate or phase-aware schedule prefill vs. decode
Use KV caches with explicit memory budgets and eviction policies
Dynamic batching with latency guardrails and backpressure/admission control
Route MoE and multi-model traffic by tier; co-locate hot experts; manage cold starts
Choose quantization per tier; align with available kernels and target SLOs
Re-tune continuously; treat inference as a control system, not a fixed config

Bottom line: LLM inference breaks simple queueing. Optimal performance is workload-dependent and multi-dimensional—compute, memory, state, routing, and bits. Measure, adapt, and schedule accordingly.

← Back to all posts

The Untold System Design Problem in LLM Inference

From fixed-shape compute to variable-shape, memory-coupled inference​

From fixed-shape compute to variable-shape, memory-coupled inference