Why fault tolerance?
Systems fail in partial, weird ways: slow dependencies, packet loss, node crashes, power events, deploy bugs. Fault tolerance keeps the blast radius small and the UX acceptable under stress.
Principles
- Remove single points of failure (SPOF): N+1 redundancy, multi‑AZ/region.
- Isolate: bulkheads, separate thread/connection pools, concurrency limits.
- Timeouts everywhere, with sensible defaults per dependency.
- Retries with jitter and budgets (never infinite, never synchronized).
- Circuit breakers to fail fast and allow recovery.
- Graceful degradation and feature flags to turn off non‑essentials.
Timeouts and budgets
# language-nginx
proxy_connect_timeout 1s;
proxy_send_timeout 1.5s;
proxy_read_timeout 1.5s;
// language-typescript
const controller = new AbortController()
const t = setTimeout(() => controller.abort(), 1200)
await fetch(url, { signal: controller.signal }).finally(() => clearTimeout(t))
Retries done right
// language-typescript
async function retry(fn, { attempts = 3, base = 100 }) {
for (let i = 0; i < attempts; i++) {
try { return await fn() } catch (e) {
if (i === attempts - 1) throw e
const jitter = base * Math.pow(2, i) + Math.random() * 50
await new Promise(r => setTimeout(r, jitter))
}
}
}
Circuit breaker (sketch)
// language-typescript
class Breaker {
state = 'CLOSED'; failures = 0; openedAt = 0
constructor(public threshold = 5, public cooldownMs = 3000) {}
async call(fn) {
if (this.state === 'OPEN' && Date.now() - this.openedAt < this.cooldownMs) throw new Error('unavailable')
if (this.state === 'OPEN') this.state = 'HALF'
try { const res = await fn(); this.failures = 0; this.state = 'CLOSED'; return res }
catch (e) { this.failures++; if (this.failures >= this.threshold) { this.state = 'OPEN'; this.openedAt = Date.now() } throw e }
}
}
Bulkheads and isolation
- Use separate pools for slow/outbound dependencies so they cannot starve core request handling.
- Apply per‑endpoint concurrency caps (e.g., 64) and shed load when exceeded.
Graceful degradation
- Serve cached or approximate data when live data is unavailable.
- Hide non‑critical widgets; reduce quality (e.g., lower image bitrate) under stress.
Data redundancy
- Replicate across AZs/regions; test failover.
- Backups are separate from replication—test restores regularly.
Observability
- RED metrics (Rate, Errors, Duration) per dependency.
- Health checks with fast ejection, slow reintroduction.
- Traces tagged with retry_count, cache_hit, degraded_mode.
Game days and chaos
Practice failure: inject latency, drop packets, kill nodes; verify alarms, auto‑scaling, and runbooks.
Runbook checklist
- Identify critical dependencies and timeouts.
- Define retry/circuit policies per dependency.
- Define degrade modes and the toggles to activate them.
- Drill failover twice a year; record MTTR.
Analogy
Think of bulkheads in a ship: if one compartment floods, the ship stays afloat. Similarly, isolate components so one failure doesn’t sink the product.
FAQ
- Should I always retry? Only idempotent operations, with budgets. Never retry on timeouts blindly—could worsen overload.
- Are long timeouts safer? No, they hide problems and consume resources. Prefer short timeouts and fast failover.