Latency vs Throughput

What they actually mean

Latency: end-to-end time to serve a single request. It includes network, queuing, locks, disk, CPU, everything.
Throughput: how many requests/tasks you finish per unit time (RPS, jobs/s, MB/s).

They are correlated but not the same. You can raise throughput by batching/queuing and still worsen tail latency. Conversely, you can slash latency by doing less work per request, which may reduce overall throughput.

Mental model: queuing theory in two bullets

As utilization approaches 100%, queueing delay explodes. Keep steady-state utilization around 60–75% if you care about predictable latency.
Batching, parallelism and pipelining increase throughput but add wait time unless carefully bounded.

Useful formulas

Little’s Law: L = λ × W. Average items in system (L) equals arrival rate (λ) times average time in system (W). If you know any two, you can infer the third.
Kingman’s formula (intuition): variability in arrivals/service times amplifies wait times non‑linearly. Reducing variance (e.g., via smoothing/batching windows) often helps more than adding raw capacity.

Optimizing for latency (user experience first)

Cut round trips: coalesce requests, multiplex, keep-alives, HTTP/3.
Put data closer: CDN, edge caches, application caches; precompute hot views.
Shorten the critical path: move non-essential work off the sync path (message queues, outboxes).
Make I/O cheap: right indexes, avoid N+1, narrow selects, compress smartly.
Control contention: reduce lock scope, shard hotspots, apply load-shedding and per-endpoint budgets.

Concrete example (API read path)

Add an application cache with 60s TTL and request coalescing (single flight).
Replace N+1 ORM pattern with a single SELECT ... WHERE id IN (...).
Enable HTTP/3 and server push for critical assets.
Set per‑endpoint concurrency budgets (e.g., 64 in‑flight) with graceful shed when exceeded.

Optimizing for throughput (capacity and cost first)

Parallelize safely: worker pools, multi-process, vectorized ops, sharding.
Increase capacity: horizontal scale with load balancing; prefer stateless services.
Batch effectively: micro-batches with latency budgets; amortize expensive work (crypto, I/O).
Smooth bursts: message queues with backpressure; use DLQs and rate limits to protect dependencies.

Concrete example (background jobs)

Move image processing to a worker pool sized to CPU cores with a queue limit; batch small images into a single vectorized operation.
Use gzip/zstd with a dictionary to amortize CPU, and upload in 8–16MB chunks.

Measurement and SLOs

Track latency percentiles (p50/p95/p99). Averages hide pain; p99 is your worst real user.
Track throughput (RPS/jobs/s) and queue depth, wait time, utilization.
Write dual SLOs: e.g. p95 < 200 ms and > 2k RPS sustained, plus an overload policy (shed, degrade, or queue with cap).

Observability checklist

RED metrics (Rate, Errors, Duration) per endpoint + per dependency.
Separate client‑observed latency from server‑processing time.
Trace spans for cache, DB, downstreams; tag with warm/cold cache and retry counts.

Code: micro‑batching vs per‑request (Node.js)

// language-javascript
// Micro-batching: flush work every 20ms or when batch reaches 50 items
const queue = []
const BATCH_SIZE = 50
const FLUSH_MS = 20
let timer = null

function flush() {
  const batch = queue.splice(0, BATCH_SIZE)
  if (batch.length === 0) return
  // Do expensive work once for the batch (e.g., DB write, external call)
  return doWork(batch).catch(() => {
    // handle retry/backoff and DLQ if needed
  })
}

export function enqueue(item) {
  queue.push(item)
  if (queue.length >= BATCH_SIZE) flush()
  if (!timer) timer = setTimeout(() => {
    timer = null
    flush()
  }, FLUSH_MS)
}

Code: Nginx rate limiting with budgets

# language-nginx
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=20r/s;

server {
  location /api/ {
    limit_req zone=api_limit burst=40 nodelay; # small burst, no queueing beyond 40
    proxy_connect_timeout 2s;
    proxy_send_timeout 2s;
    proxy_read_timeout 2s;
    proxy_pass http://backend;
  }
}

Common traps

Confusing CPU time with latency. Wall time is what users feel.
Removing all queues. A thin queue with backpressure is often the safest way to protect downstreams.
Unlimited concurrency. Without budgets you create head-of-line blocking and timeout storms.

Case study: checkout page

Before: 7 network round trips, cold DB cache, unbounded retries to payment provider. p99 = 3.2s under peak.
After: coalesced product/price lookups, payment tokenization off path, bounded retries with jitter, edge cache for static bundles. p99 = 600ms at 2× throughput.

Pragmatic guidance

For interactive paths, budget every step and guard tail latency. For batch/analytics paths, maximize throughput subject to a reasonable deadline. Design explicitly for the trade-off rather than hoping one metric “improves” the other.

Pre‑deployment checklist

Set per‑endpoint latency budgets and enforce with tests.
Define overload behavior (shed, queue with cap, degrade output).
Validate capacity with a ramped load test; watch p95/p99, queue depth, and error budgets.

Analogy

Latency is the time it takes one car to finish a lap. Throughput is how many cars finish per minute. You can send more cars onto the track (higher throughput) but risk traffic jams (higher latency), especially if the pit lane (dependencies) is slow.

FAQ

Why does my p99 go wild under load? Queues + variability. Cap concurrency and smooth bursts.
Should I always batch? Batch with latency budgets; small micro‑batches beat giant ones for interactive paths.

Try it (wrk)

# language-bash
wrk -t4 -c64 -d60s --latency https://your-service/

*Latency vs Throughput