*Fault Tolerance

September 15, 2025

Why fault tolerance?

Systems fail in partial, weird ways: slow dependencies, packet loss, node crashes, power events, deploy bugs. Fault tolerance keeps the blast radius small and the UX acceptable under stress.

Principles

  • Remove single points of failure (SPOF): N+1 redundancy, multi‑AZ/region.
  • Isolate: bulkheads, separate thread/connection pools, concurrency limits.
  • Timeouts everywhere, with sensible defaults per dependency.
  • Retries with jitter and budgets (never infinite, never synchronized).
  • Circuit breakers to fail fast and allow recovery.
  • Graceful degradation and feature flags to turn off non‑essentials.

Timeouts and budgets

# language-nginx
proxy_connect_timeout 1s;
proxy_send_timeout 1.5s;
proxy_read_timeout 1.5s;
// language-typescript
const controller = new AbortController()
const t = setTimeout(() => controller.abort(), 1200)
await fetch(url, { signal: controller.signal }).finally(() => clearTimeout(t))

Retries done right

// language-typescript
async function retry(fn, { attempts = 3, base = 100 }) {
  for (let i = 0; i < attempts; i++) {
    try { return await fn() } catch (e) {
      if (i === attempts - 1) throw e
      const jitter = base * Math.pow(2, i) + Math.random() * 50
      await new Promise(r => setTimeout(r, jitter))
    }
  }
}

Circuit breaker (sketch)

// language-typescript
class Breaker {
  state = 'CLOSED'; failures = 0; openedAt = 0
  constructor(public threshold = 5, public cooldownMs = 3000) {}
  async call(fn) {
    if (this.state === 'OPEN' && Date.now() - this.openedAt < this.cooldownMs) throw new Error('unavailable')
    if (this.state === 'OPEN') this.state = 'HALF'
    try { const res = await fn(); this.failures = 0; this.state = 'CLOSED'; return res }
    catch (e) { this.failures++; if (this.failures >= this.threshold) { this.state = 'OPEN'; this.openedAt = Date.now() } throw e }
  }
}

Bulkheads and isolation

  • Use separate pools for slow/outbound dependencies so they cannot starve core request handling.
  • Apply per‑endpoint concurrency caps (e.g., 64) and shed load when exceeded.

Graceful degradation

  • Serve cached or approximate data when live data is unavailable.
  • Hide non‑critical widgets; reduce quality (e.g., lower image bitrate) under stress.

Data redundancy

  • Replicate across AZs/regions; test failover.
  • Backups are separate from replication—test restores regularly.

Observability

  • RED metrics (Rate, Errors, Duration) per dependency.
  • Health checks with fast ejection, slow reintroduction.
  • Traces tagged with retry_count, cache_hit, degraded_mode.

Game days and chaos

Practice failure: inject latency, drop packets, kill nodes; verify alarms, auto‑scaling, and runbooks.

Runbook checklist

  • Identify critical dependencies and timeouts.
  • Define retry/circuit policies per dependency.
  • Define degrade modes and the toggles to activate them.
  • Drill failover twice a year; record MTTR.

Analogy

Think of bulkheads in a ship: if one compartment floods, the ship stays afloat. Similarly, isolate components so one failure doesn’t sink the product.

FAQ

  • Should I always retry? Only idempotent operations, with budgets. Never retry on timeouts blindly—could worsen overload.
  • Are long timeouts safer? No, they hide problems and consume resources. Prefer short timeouts and fast failover.