The Difference Between Latency and Throughput and Why Both Matter

by Eric Hanson, Backend Developer at Clean Systems Consulting

The Performance Conversation That Goes Wrong

"We need to improve performance." This starts a conversation that usually ends in one of two places: either optimizing response time for individual requests, or trying to handle more requests per second. These are different problems. Sometimes you have one. Sometimes you have both. Very often, the team treats them as the same problem and optimizes for the wrong one.

Latency is how long a single operation takes. Throughput is how many operations the system can complete per unit of time. A system can have low latency and low throughput, high latency and high throughput, or any other combination. Understanding which one is your actual constraint is the prerequisite for any useful performance work.

What Latency Is and How to Measure It

Latency is the time from when a request is initiated to when the response is received. It is measured per request and typically reported as a distribution, not a single number.

P50 (median) tells you what a typical request looks like. P95 and P99 tell you what 5% and 1% of requests experience. P99.9 tells you about the worst 1-in-1000 request. High-percentile latencies are what users complain about — they correspond to the requests that felt "hung" or "broken."

A common mistake is measuring and reporting only average latency. Averages hide the distribution. An API with a p50 of 50ms and a p99 of 4,000ms has a healthy average but a terrible tail latency. If 1% of your users wait 4 seconds on every request, that's a significant user experience problem — and it's invisible in the average.

# Useful latency measurement with Prometheus histogram (Java with Micrometer):
Timer.builder("http.request.duration")
    .description("HTTP request latency")
    .tags("endpoint", endpoint, "method", method)
    .publishPercentiles(0.5, 0.95, 0.99, 0.999)
    .register(meterRegistry);

What Throughput Is and How It's Limited

Throughput is the rate at which the system completes work — requests per second, transactions per second, events processed per second. It is limited by the capacity of the system's bottleneck resource.

Little's Law formalizes the relationship: Throughput = Concurrency / Latency. If your average latency is 100ms and you have 10 concurrent request slots (threads), your maximum throughput is 10 / 0.1 = 100 requests/second. To double throughput, you either halve latency or double concurrency.

This has a direct implication: reducing latency increases throughput without adding resources. Adding resources (more threads, more instances) increases throughput without reducing latency.

The Tension Between Them

Optimizations that improve throughput often worsen latency, and vice versa:

Batching dramatically increases throughput — processing 100 items in a batch costs far less than 100 individual operations. But each item in the batch waits for the batch to fill before being processed. Latency per item increases. For background jobs and analytics workloads, this tradeoff is excellent. For interactive user requests, introducing a 500ms wait to batch 50 requests is usually unacceptable.

Connection pooling allows many requests to share a limited number of database connections, increasing throughput. But when the pool is at capacity, new requests queue for a connection. For users whose request arrives when the pool is exhausted, perceived latency increases significantly.

Asynchronous processing maximizes throughput by decoupling producers from consumers. But the individual operation no longer has a synchronous response time — the latency from "user submitted request" to "result visible" may now be seconds or minutes rather than milliseconds.

When Each Metric Is the Right Target

Optimize latency when: Your system serves interactive user requests where response time affects user experience or conversion. E-commerce checkout flows, real-time dashboards, API endpoints where human users are waiting. Here, p99 latency is the metric that matters.

Optimize throughput when: Your system processes jobs, events, or data in the background where the individual operation latency is less important than total work done per unit of time. ETL pipelines, batch processing, event consumers. Here, events processed per second or jobs completed per hour is the metric.

Many systems need both — interactive endpoints with low latency and background jobs with high throughput. These should be measured and optimized separately. Running a high-throughput batch job on the same thread pool as your interactive API will sacrifice latency for throughput on user-facing requests.

The Practical Measurement Setup

Before optimizing anything:

  1. Measure p50, p95, p99 latency for your key endpoints under realistic load
  2. Measure current throughput (requests/second) and identify the resource ceiling
  3. Determine which metric is your actual problem — users complaining about slowness (latency), or system capacity limits preventing growth (throughput)
  4. Choose optimization techniques appropriate to the problem

Load testing with k6 or Gatling will reveal both simultaneously: run a ramp-up test that increases virtual users over time. Latency at low concurrency tells you your baseline single-request performance. Throughput at the point where latency starts degrading tells you your practical capacity ceiling.

The Practical Takeaway

In your next performance discussion, ask the team to state whether the complaint is about latency (individual requests feel slow) or throughput (the system can't handle the load). These are different problems. If latency is the issue, trace the slowest requests and find the bottleneck in the critical path. If throughput is the issue, find the resource that saturates first and either expand it or reduce per-request consumption of it. Don't optimize throughput when latency is the problem, or vice versa.

Scale Your Backend - Need an Experienced Backend Developer?

We provide backend engineers who join your team as contractors to help build, improve, and scale your backend systems.

We focus on clean backend design, clear documentation, and systems that remain reliable as products grow. Our goal is to strengthen your team and deliver backend systems that are easy to operate and maintain.

We work from our own development environments and support teams across US, EU, and APAC timezones. Our workflow emphasizes documentation and asynchronous collaboration to keep development efficient and focused.

  • Production Backend Experience. Experience building and maintaining backend systems, APIs, and databases used in production.
  • Scalable Architecture. Design backend systems that stay reliable as your product and traffic grow.
  • Contractor Friendly. Flexible engagement for short projects, long-term support, or extra help during releases.
  • Focus on Backend Reliability. Improve API performance, database stability, and overall backend reliability.
  • Documentation-Driven Development. Development guided by clear documentation so teams stay aligned and work efficiently.
  • Domain-Driven Design. Design backend systems around real business processes and product needs.

Tell us about your project

Our offices

  • Copenhagen
    1 Carlsberg Gate
    1260, København, Denmark
  • Magelang
    12 Jalan Bligo
    56485, Magelang, Indonesia

More articles

APIs Are Not Just CRUD: Why Complex Systems Need Domain-Driven Architecture

APIs are often treated as simple CRUD endpoints, but real-world systems are more tangled. Domain-driven architecture (DDA) helps keep complexity under control.

Read more

Building a Reputation as a Remote Backend Contractor Takes Time. Here Is Where to Start.

Reputation is the compound interest of contracting — slow to build, but once it is there, it does work you cannot do through active marketing.

Read more

Choosing the Right Base Image Is More Important Than You Think

The base image you choose determines your image size, vulnerability surface area, available tooling, and update cadence. Defaulting to the official full image is a decision with real costs that compound over time.

Read more

PostgreSQL for Java Developers — The Features You Should Be Using

Most Java applications use PostgreSQL as a dumb key-value store with SQL syntax. PostgreSQL has capabilities that eliminate entire categories of application code — JSONB for flexible schemas, full-text search, window functions, advisory locks, and LISTEN/NOTIFY for real-time events.

Read more