The Difference Between Latency and Throughput and Why Both Matter
by Eric Hanson, Backend Developer at Clean Systems Consulting
The Performance Conversation That Goes Wrong
"We need to improve performance." This starts a conversation that usually ends in one of two places: either optimizing response time for individual requests, or trying to handle more requests per second. These are different problems. Sometimes you have one. Sometimes you have both. Very often, the team treats them as the same problem and optimizes for the wrong one.
Latency is how long a single operation takes. Throughput is how many operations the system can complete per unit of time. A system can have low latency and low throughput, high latency and high throughput, or any other combination. Understanding which one is your actual constraint is the prerequisite for any useful performance work.
What Latency Is and How to Measure It
Latency is the time from when a request is initiated to when the response is received. It is measured per request and typically reported as a distribution, not a single number.
P50 (median) tells you what a typical request looks like. P95 and P99 tell you what 5% and 1% of requests experience. P99.9 tells you about the worst 1-in-1000 request. High-percentile latencies are what users complain about — they correspond to the requests that felt "hung" or "broken."
A common mistake is measuring and reporting only average latency. Averages hide the distribution. An API with a p50 of 50ms and a p99 of 4,000ms has a healthy average but a terrible tail latency. If 1% of your users wait 4 seconds on every request, that's a significant user experience problem — and it's invisible in the average.
# Useful latency measurement with Prometheus histogram (Java with Micrometer):
Timer.builder("http.request.duration")
.description("HTTP request latency")
.tags("endpoint", endpoint, "method", method)
.publishPercentiles(0.5, 0.95, 0.99, 0.999)
.register(meterRegistry);
What Throughput Is and How It's Limited
Throughput is the rate at which the system completes work — requests per second, transactions per second, events processed per second. It is limited by the capacity of the system's bottleneck resource.
Little's Law formalizes the relationship: Throughput = Concurrency / Latency. If your average latency is 100ms and you have 10 concurrent request slots (threads), your maximum throughput is 10 / 0.1 = 100 requests/second. To double throughput, you either halve latency or double concurrency.
This has a direct implication: reducing latency increases throughput without adding resources. Adding resources (more threads, more instances) increases throughput without reducing latency.
The Tension Between Them
Optimizations that improve throughput often worsen latency, and vice versa:
Batching dramatically increases throughput — processing 100 items in a batch costs far less than 100 individual operations. But each item in the batch waits for the batch to fill before being processed. Latency per item increases. For background jobs and analytics workloads, this tradeoff is excellent. For interactive user requests, introducing a 500ms wait to batch 50 requests is usually unacceptable.
Connection pooling allows many requests to share a limited number of database connections, increasing throughput. But when the pool is at capacity, new requests queue for a connection. For users whose request arrives when the pool is exhausted, perceived latency increases significantly.
Asynchronous processing maximizes throughput by decoupling producers from consumers. But the individual operation no longer has a synchronous response time — the latency from "user submitted request" to "result visible" may now be seconds or minutes rather than milliseconds.
When Each Metric Is the Right Target
Optimize latency when: Your system serves interactive user requests where response time affects user experience or conversion. E-commerce checkout flows, real-time dashboards, API endpoints where human users are waiting. Here, p99 latency is the metric that matters.
Optimize throughput when: Your system processes jobs, events, or data in the background where the individual operation latency is less important than total work done per unit of time. ETL pipelines, batch processing, event consumers. Here, events processed per second or jobs completed per hour is the metric.
Many systems need both — interactive endpoints with low latency and background jobs with high throughput. These should be measured and optimized separately. Running a high-throughput batch job on the same thread pool as your interactive API will sacrifice latency for throughput on user-facing requests.
The Practical Measurement Setup
Before optimizing anything:
- Measure p50, p95, p99 latency for your key endpoints under realistic load
- Measure current throughput (requests/second) and identify the resource ceiling
- Determine which metric is your actual problem — users complaining about slowness (latency), or system capacity limits preventing growth (throughput)
- Choose optimization techniques appropriate to the problem
Load testing with k6 or Gatling will reveal both simultaneously: run a ramp-up test that increases virtual users over time. Latency at low concurrency tells you your baseline single-request performance. Throughput at the point where latency starts degrading tells you your practical capacity ceiling.
The Practical Takeaway
In your next performance discussion, ask the team to state whether the complaint is about latency (individual requests feel slow) or throughput (the system can't handle the load). These are different problems. If latency is the issue, trace the slowest requests and find the bottleneck in the critical path. If throughput is the issue, find the resource that saturates first and either expand it or reduce per-request consumption of it. Don't optimize throughput when latency is the problem, or vice versa.