Fault Tolerance

Why fault tolerance?

Systems fail in partial, weird ways: slow dependencies, packet loss, node crashes, power events, deploy bugs. Fault tolerance keeps the blast radius small and the UX acceptable under stress.

Principles

Remove single points of failure (SPOF): N+1 redundancy, multi‑AZ/region.
Isolate: bulkheads, separate thread/connection pools, concurrency limits.
Timeouts everywhere, with sensible defaults per dependency.
Retries with jitter and budgets (never infinite, never synchronized).
Circuit breakers to fail fast and allow recovery.
Graceful degradation and feature flags to turn off non‑essentials.

Timeouts and budgets

# language-nginx
proxy_connect_timeout 1s;
proxy_send_timeout 1.5s;
proxy_read_timeout 1.5s;

// language-typescript
const controller = new AbortController()
const t = setTimeout(() => controller.abort(), 1200)
await fetch(url, { signal: controller.signal }).finally(() => clearTimeout(t))

Retries done right

// language-typescript
async function retry(fn, { attempts = 3, base = 100 }) {
  for (let i = 0; i < attempts; i++) {
    try { return await fn() } catch (e) {
      if (i === attempts - 1) throw e
      const jitter = base * Math.pow(2, i) + Math.random() * 50
      await new Promise(r => setTimeout(r, jitter))
    }
  }
}

Circuit breaker (sketch)

// language-typescript
class Breaker {
  state = 'CLOSED'; failures = 0; openedAt = 0
  constructor(public threshold = 5, public cooldownMs = 3000) {}
  async call(fn) {
    if (this.state === 'OPEN' && Date.now() - this.openedAt < this.cooldownMs) throw new Error('unavailable')
    if (this.state === 'OPEN') this.state = 'HALF'
    try { const res = await fn(); this.failures = 0; this.state = 'CLOSED'; return res }
    catch (e) { this.failures++; if (this.failures >= this.threshold) { this.state = 'OPEN'; this.openedAt = Date.now() } throw e }
  }
}

Bulkheads and isolation

Use separate pools for slow/outbound dependencies so they cannot starve core request handling.
Apply per‑endpoint concurrency caps (e.g., 64) and shed load when exceeded.

Graceful degradation

Serve cached or approximate data when live data is unavailable.
Hide non‑critical widgets; reduce quality (e.g., lower image bitrate) under stress.

Data redundancy

Replicate across AZs/regions; test failover.
Backups are separate from replication—test restores regularly.

Observability

RED metrics (Rate, Errors, Duration) per dependency.
Health checks with fast ejection, slow reintroduction.
Traces tagged with retry_count, cache_hit, degraded_mode.

Game days and chaos

Practice failure: inject latency, drop packets, kill nodes; verify alarms, auto‑scaling, and runbooks.

Runbook checklist

Identify critical dependencies and timeouts.
Define retry/circuit policies per dependency.
Define degrade modes and the toggles to activate them.
Drill failover twice a year; record MTTR.

Analogy

Think of bulkheads in a ship: if one compartment floods, the ship stays afloat. Similarly, isolate components so one failure doesn’t sink the product.

FAQ

Should I always retry? Only idempotent operations, with budgets. Never retry on timeouts blindly—could worsen overload.
Are long timeouts safer? No, they hide problems and consume resources. Prefer short timeouts and fast failover.

*Fault Tolerance