Platform Resilience Patterns That Actually Pay Off / o2no Journal

Resilience patterns are the area of platform engineering with the worst signal-to-noise ratio. Every blog post recommends the same four patterns. Every postmortem cites a different missing one. Most teams implement none of them correctly.

This is the short version of what we have learned shipping platforms that survive their first incident.

The Patterns Worth Implementing

1. Timeouts Everywhere, Configured Per-Dependency

The single most common cause of cascading failure we see is a default timeout. The HTTP client defaults to 30s. The database driver defaults to 60s. Nothing in your system actually wants to wait that long, but nobody wrote a smaller number, so the request thread sits there until a load balancer kills the connection.

Every outbound call in the system should have a timeout. That timeout should be derived from the SLO of the calling endpoint, not chosen by feel. If the endpoint promises p99 of 500ms, the synchronous dependency budget cannot be 1s.

const dbClient = createClient({
  connectionTimeoutMs: 200,
  queryTimeoutMs: 400,
});

If the value looks aggressive, it probably is. Aggressive is correct.

2. Circuit Breakers, But Only On Synchronous Hot Paths

Circuit breakers are powerful and frequently misapplied. They belong in front of synchronous dependencies that block user-visible requests. They do not belong in front of background workers, retries, or async fan-outs — those have their own backpressure mechanisms.

A circuit breaker on a webhook delivery worker is actively harmful: it amplifies the very backlog you are trying to drain.

3. Bulkheads Via Connection Pools, Not Microservices

You do not need to extract a service to isolate a failure mode. A separate connection pool with a separate semaphore is a bulkhead. It costs ten lines of code and zero new deployment artifacts.

The classic instance: one slow query on the reporting endpoint exhausts the shared pool and kills login. Solution: a dedicated pool sized to the reporting workload, not a new microservice.

The Patterns That Sound Smart And Aren't

Retries With Exponential Backoff Without A Budget

Retries without a budget are how you turn a brief blip into a thundering herd. The pattern that works:

Retry budget per request (max 2 attempts, max 200ms total)
Retry budget per service (no more than 10% of total RPS can be retries)
Jittered backoff, never pure exponential

If you cannot enforce the second budget, you do not have retries — you have a denial of service amplifier.

Multi-Region Active-Active Before You Have Active-One Working

We have walked into more than one engagement where the team was midway through a multi-region migration and could not reliably keep a single region healthy. Multi-region adds a coordination tax that compounds existing reliability problems.

Get one region to four nines first. Then talk about two.

Graceful Degradation Is The Pattern That Wins

The single highest-leverage resilience investment is explicit degradation paths for the dependencies that fail most often.

When the recommendation service is down, render the page without recommendations. When the avatar CDN is slow, fall back to initials. When the analytics queue is backed up, drop the lowest-priority events at the edge.

Every one of those is a few lines of code. None of them are sexy. All of them are the difference between a status page incident and a Twitter incident.

The best resilience pattern is the one that lets your users not notice.

If your platform is one outage away from a hard conversation with a customer, we should talk.

Platform Resilience Patterns That Actually Pay Off