Redr
1 / 240

Redr · Study Guide

Release It!

Design and Deploy Production-Ready Software

Michael T. Nygard

Unofficial AI-assisted study guide. Not affiliated with or endorsed by the author or publisher. For educational use — supplements, not replaces, the original work.

Contents

Part 01
Create Stability
  • 01Living in Production
  • 02Case Study: The Exception That Grounded an Airline
  • 03Stabilize Your System
  • 04Stability Antipatterns
  • 05Stability Patterns
Part 02
Design for Production
  • 06Case Study: Phenomenal Cosmic Powers, Itty-Bitty Living Space
  • 07Foundations
  • 08Processes on Machines
  • 09Interconnect
  • 10Control Plane
  • 11Security
Part 03
Deliver Your System
  • 12Case Study: Waiting for Godot
  • 13Design for Deployment
  • 14Handling Versions
Part 04
Solve Systemic Problems
  • 15Case Study: Trampled by Your Own Customers
  • 16Adaptation
  • 17Chaos Engineering

Part 01

Create Stability

Ch. 1–5

Ch. 01

Living in Production

Code only delivers value once it serves real users, and production exposes scale, adversaries, and quirks no test environment reproduces. Designing for operations, deployment, and failure must be a first-class concern, not an afterthought tacked on at the end of a project.

Ch. 01

Software Only Earns Value in Production

A feature in a repo or staging environment is worth nothing. The moment of deployment is also the moment the business starts earning revenue and incurring liability. Every design choice that ignores this — feature-only thinking, deploy-as-an-afterthought — defers cost into the most expensive environment to fix it.

Ch. 01

QA Cannot Reproduce Production

Production has scale, network partitions, hardware quirks, and adversaries that test environments simply cannot mimic. A clean test pass means the code does what was specified — not that it survives the real world. Passing QA is not the same as production-ready.

Ch. 01

Cost of Downtime is Non-Linear

Outages do not merely lose transactions for the duration. They erode customer trust, trigger regulatory scrutiny, and consume engineering time for weeks after as teams scramble through stabilization patches under public pressure. The total cost of a one-hour outage often dwarfs the cost of preventing it.

Ch. 01

Operations is a Stakeholder

Operators are first-class users of the system with their own usability needs: logs, metrics, knobs, deploy hooks, kill switches. A system that is hostile to operate is not finished, regardless of how the features test. Design for the person paged at 3 a.m.

Ch. 01

Pragmatic Architecture Over Ivory Tower

Top-down architectural decrees that mandate specific middleware or frameworks rarely survive contact with production constraints. The goal is shipping value reliably, not architectural purity. Beware grand standards disconnected from operational reality.

Ch. 01 · Vocab
Production
The environment where real users interact with the system and the business earns revenue or incurs liability.
Production-ready
Software designed so it can be deployed, operated, monitored, recovered, and changed safely under real-world conditions.
Uptime / Availability
The fraction of time a system is responsive to users; often expressed as nines (99.9%, 99.99%, 99.999%).
Five nines
99.999% availability — roughly five minutes of downtime per year.
Ch. 01 · Vocab
Ivory tower architecture
Top-down architectural decrees disconnected from operational reality.
Total Cost of Ownership (TCO)
The full lifetime cost of the software — build plus run plus change plus retire — not just development.
Operational concerns
Cross-cutting needs like logging, monitoring, deployment, configuration, secrets, and on-call response.
Ch. 01 · Quiz1 / 4

Multiple choice

A team ships a feature that passes every functional and integration test in QA, but is paged the night after release because real users see intermittent errors under load. Which idea best explains the gap?

Ch. 01 · Quiz2 / 4

True / False

A one-hour outage costs the business roughly one hour's worth of revenue.

Ch. 01 · Quiz3 / 4

Multiple choice

An architecture review board mandates a specific enterprise message bus for all new services, regardless of whether the service is a low-traffic batch job or a high-throughput user-facing API. What antipattern does this most resemble?

Ch. 01 · Quiz4 / 4

Spot the issue

A service ships with no log levels, no metrics endpoint, no health check, and no way to flip a feature off without a redeploy. The functional tests all pass and product is happy. What is the core problem from a Release It! perspective?

Ch. 02

Case Study: The Exception That Grounded an Airline

An airline check-in system collapsed nationwide at the morning peak, taking down kiosks and gate agents for three hours and delaying flights for nearly nine. The root cause was a single uncaught SQLException, wrapped in two more exceptions, that slowly drained the JDBC connection pool until every application server was blocked.

Ch. 02

A Tiny Bug Can Ground a Fleet

Hundreds of thousands of dollars of impact traced to a few lines of unhandled exception code. Stability problems are rarely glamorous — they are usually small defects in places nobody thought to look. Magnitude of consequence is uncorrelated with magnitude of cause.

Ch. 02

Failures Hide in Low-Frequency Code Paths

The bad path only executed during a database failover, so it never surfaced in test or normal operations. Code that runs once a year still runs in production, and when it does, it had better work. Test the failure modes you hope never to use.

Ch. 02

Resource Leaks Compound Silently

Each thrown exception leaked one JDBC connection. The pool slowly drained until total deadlock — no errors, no crashes, just silence. A healthy log with an unresponsive server points to blocked threads, not crashes.

Ch. 02

Symptoms Appear Far From Causes

Operators saw frozen check-in kiosks. The actual fault was a `SQLException` wrapped in an `InvocationTargetException` wrapped in a `RemoteException` deep inside an EJB call. Naive `catch` blocks never saw the real exception. Unwrap exceptions all the way down.

Ch. 02

Restart Is Not a Fix

Bouncing the cluster restored service but cost three hours and obscured the real defect. Root-cause analysis is essential — a restart that "fixes" something you do not understand is a future incident scheduled. The same bug bites twice if you do not name it.

Ch. 02

Blast Radius Matters

A check-in outage forced manual workflows that propagated delays through gates, crews, and connecting flights. One small system, badly contained, took out an entire operational network. Containment is a design choice you make before the failure.

Ch. 02 · Vocab
Failover
Automated handoff from a failed primary database (or component) to a standby; a rare code path that exposes latent bugs.
Connection pool
A bounded set of pre-opened database connections shared by request threads; exhausting it blocks every new request.
Blocked thread
A thread waiting indefinitely on a resource (lock, socket, pooled connection) and therefore unable to serve traffic.
SQLException
JDBC's checked exception for database errors; in the case study, the buried real fault.
Ch. 02 · Vocab
InvocationTargetException / RemoteException
Java reflection and RMI wrappers that nested the SQLException several layers deep.
Stranded resource
A resource still allocated but no longer reachable for cleanup or reuse.
Root-cause analysis (RCA)
Forensic review of an incident to identify the underlying defect, not just the visible symptom.
Ch. 02 · Quiz1 / 4

Spot the issue

What's wrong with this snippet, given the airline case study?

try {
  Connection c = pool.getConnection();
  doWork(c);
  c.close();
} catch (Exception e) {
  log.error("query failed", e);
}
Ch. 02 · Quiz2 / 4

Multiple choice

The grounding was traced to a code path that only ran during a database failover. What is the broader lesson?

Ch. 02 · Quiz3 / 4

True / False

The fact that a server is still logging means the application is healthy.

Ch. 02 · Quiz4 / 4

Multiple choice

Catch blocks in the airline system caught `RemoteException` and logged a generic "remote failure" message, never noticing the wrapped `SQLException` underneath. What principle does this violate?

Ch. 03

Stabilize Your System

Stability is not the absence of faults but their containment. Nygard defines a hierarchy — fault becomes error becomes failure — and distinguishes the short, sharp shocks of impulses from the slow grind of stresses. The job is to keep faults from becoming user-visible outages and to limit the damage when they do.

Ch. 03

Fault, Error, Failure

A fault is a defective internal state (latent bug, bad input). An error is when that fault produces visibly incorrect behavior. A failure is when the system stops doing useful work for users. Stability work tries to keep faults from propagating up this chain.

Ch. 03

Cracks Propagate Along Integration Points

Most cracks travel between systems via the calls that connect them — RPC, HTTP, database, message broker. Defense lives at those boundaries. Every line where your code talks to something it does not own is a place to harden.

Ch. 03

Impulses vs. Stresses

Some forces hit hard and fast — a traffic spike, a power outage, a dependency vanishing. Others wear the system down over time — memory leaks, slowly slowing dependencies, accumulating data. You design for both, but they require different defenses.

Ch. 03

Containment Over Prevention

You cannot prevent every fault, but you can keep the blast radius small and predictable. The goal is not "no failures" — it is "no failure ever takes more of the system with it than necessary."

Ch. 03

Memory Leaks and Data Growth Are the Chief Enemies of Longevity

Systems that run "forever" must actively shed accumulated state — rotate logs, expire caches, purge old data, recycle resources. Every system that requires fiddling to keep alive will eventually page someone at 3 a.m. Design for the long tail of uptime.

Ch. 03

Production Lifespan Dwarfs Development

Code lives in production years longer than it took to write. Operators turn over, business rules drift, dependencies are deprecated. Decisions made for the next sprint must survive the next decade.

Ch. 03 · Vocab
Fault
A condition that creates an incorrect internal state — a latent bug or unchecked boundary condition.
Error
Visibly incorrect behavior caused by a fault propagating outward.
Failure
An unresponsive system; when the system stops doing useful work for users.
Crack
Nygard's metaphor for the path a fault takes as it spreads from component to component.
Ch. 03 · Vocab
Impulse
A short, sharp force applied to the system (DoS, flash crowd, sudden dependency outage).
Stress
A long-term force that grinds the system down (slow downstream, leak, gradually rising load).
Longevity
The system's ability to run continuously over long periods without degradation.
MTBF
Mean Time Between Failures — average runtime between incidents.
Ch. 03 · Quiz1 / 4

Multiple choice

A bad input causes an internal data structure to enter an inconsistent state. The next user request reads that structure and returns garbage. Eventually the service stops responding to any requests at all. Map these events to Nygard's hierarchy.

Ch. 03 · Quiz2 / 4

Multiple choice

A flash sale drives 20x normal traffic for ten minutes and the system melts. A slowly leaking cache eats heap over six weeks and the system melts. Which pair of terms best describes these forces?

Ch. 03 · Quiz3 / 4

Spot the issue

A team proudly claims their new platform "will never fail" because every component has been hardened with retries, timeouts, and redundant instances. A senior engineer pushes back. What is the senior engineer's most likely objection?

Ch. 03 · Quiz4 / 4

True / False

Once a system reaches production, the original development decisions no longer matter much because operators take over.

Ch. 04

Stability Antipatterns

Chapter 4 catalogs concrete failure modes Nygard has personally watched bring down production systems: how they start, why they spread, and why they are usually invisible in test. Each antipattern is a recurring shape of failure, almost always emerging along integration points and amplified by tight coupling, resource sharing, or naive retry logic.

Ch. 04

Integration Points

Every connection to another system — DB, API, message broker — is a stability risk. Every socket, RPC, or HTTP call can hang, time out, return garbage, or close unexpectedly. Every integration point will eventually fail in some way, and you have to be prepared for that failure.

Ch. 04

Chain Reactions

When one server in a pool dies, survivors absorb its share of load. If the original failure was load- or leak-induced, the others fail faster and the pool collapses horizontally. Identical instances share identical weaknesses.

Ch. 04

Cascading Failures

A failure in one layer travels upstream into its callers, who handle it poorly — retry storms, blocked threads, pile-ups — spreading the crack vertically through the stack. A failure handled badly is worse than the failure itself.

Ch. 04

Blocked Threads

Threads stuck waiting on locks, pools, or unresponsive remote calls. Almost every application-level outage in the book traces back to threads that cannot make progress. The pool drains; the lights stay on; nothing actually moves.

Ch. 04

Slow Responses

Worse than outright errors. Slow responses tie up caller resources, trigger user retries, and propagate latency throughout the system until everything is stuck waiting. Fast failure beats slow success.

Ch. 04

Unbounded Result Sets

Queries or APIs that return "however much is in the table" work fine in test and explode in production when the table grows. They exhaust memory, sockets, and patience. Always bound what comes back from across a boundary.

Ch. 04

Self-Denial Attacks

The system attacks itself: a marketing email blast, a cache flush, an internal job that suddenly drives load the system was never sized for. Your own actions can be the DDoS.

Ch. 04

Dogpile and Thundering Herd

Many clients hammer a service simultaneously after a restart, cache expiry, or scheduled job. The synchronized stampede crushes whatever they are calling. Jitter your retries, stagger your timers.

Ch. 04 · Vocab
Antipattern
A common, repeatable design choice that looks reasonable but reliably produces bad outcomes.
Integration point
Any boundary where your system depends on another system to respond.
Retry storm
A flood of automatic retries from clients that turns a transient blip into sustained denial-of-service.
Resource pool
A bounded shared collection (threads, DB connections, sockets) where blocked-thread problems concentrate.
Ch. 04 · Vocab
Tight coupling
A dependency relationship where failure or slowdown in one component directly stalls another.
Latent bug
A defect in code that does not manifest until a rare condition triggers it.
Force multiplier
Any mechanism (automation, orchestration, scripted ops) that amplifies a small action — or a small mistake — into a huge effect.
Thundering herd
A flood of synchronized clients converging on a resource at the same instant.
Ch. 04 · Quiz1 / 4

Spot the issue

A fleet of eight identical app servers shares one database. One server crashes from a memory leak. The load balancer redistributes its traffic across the remaining seven, which now leak proportionally faster. Within 20 minutes the whole tier is down. What antipattern is this?

Ch. 04 · Quiz2 / 4

Multiple choice

A downstream service starts taking 30 seconds to respond instead of returning errors. The upstream caller's thread pool fills with threads parked on the slow call. Soon the upstream caller is also unresponsive to everyone. What does Nygard say about this scenario?

Ch. 04 · Quiz3 / 4

Multiple choice

A team adds an endpoint that returns "all transactions for this account." In test, accounts have a few rows; in production a year later, a long-lived account has 4 million rows and the endpoint OOMs the service. Which antipattern is this?

Ch. 04 · Quiz4 / 4

Multiple choice

Marketing sends a promotional email to 5 million customers at 9 a.m. with a link that deep-links into the catalog service, knocking it over within minutes. Which pattern best names what happened?

Ch. 05

Stability Patterns

Chapter 5 is the counterweight to Chapter 4: a toolbox of patterns that limit blast radius, decouple components, and force the system to acknowledge its own limits. They share a philosophy — fail visibly and fast, isolate damage, never assume a remote call returns — and applied at integration points they convert outages into degraded-but-running operation.

Ch. 05

Timeouts

Every remote call, every resource acquisition, gets a bounded wait. Without timeouts, blocked threads accumulate until the pool drains. A call without a timeout is a call that can hang forever.

Ch. 05

Circuit Breaker

Wraps a remote call with a state machine — closed, open, half-open. After repeated failures the breaker opens and fails fast for a cooldown, then tentatively probes recovery via a half-open trial call. Stop hitting something that is already broken.

Ch. 05

Bulkheads

Borrowed from ship design: partition resources — thread pools, connection pools, server groups — so a flood in one compartment cannot sink the vessel. Isolate your failure domains.

Ch. 05

Steady State

Design the system to run indefinitely without human intervention: rotate logs, purge old data, expire caches, recycle resources. Every fiddling-required system is a future page. If it needs a cron job to stay alive, it is already dying.

Ch. 05

Fail Fast

Check preconditions — capacity, dependencies, input validity — up front and reject doomed requests immediately rather than burning resources only to fail later. A quick no is a kindness.

Ch. 05

Let It Crash

From the Erlang world: when a process gets into a bad state, the safest recovery is often to kill it and let a supervisor restart a clean instance. Pairs with strong isolation and immutable state. Don't reason about corrupt state — replace it.

Ch. 05

Handshaking and Back Pressure

Caller and callee cooperate on flow control. The server signals "I am full" or "slow down," and the client respects it. Producers slow down before queues overflow. Silence as a signal of saturation is a bug.

Ch. 05

Shed Load

When overloaded, deliberately drop or reject low-priority work to protect the system's ability to serve anything at all. Half the requests served beats all the requests stuck.

Ch. 05 · Vocab
Pattern
A reusable design solution to a recurring problem, paired with the context it applies in.
Circuit breaker states
Closed (calls pass), Open (calls fail fast), Half-open (a trial call probes recovery).
Bulkhead
An isolation boundary that prevents one failing partition from consuming resources owned by another.
Timeout budget
The total time allowed for a request across all nested calls; child timeouts must be tighter than parent timeouts.
Ch. 05 · Vocab
Back pressure
A feedback mechanism by which a downstream component tells upstream producers to slow down.
Load shedding
Deliberately dropping requests when capacity is exceeded to keep the rest of the system healthy.
Supervisor
A process that monitors children and restarts them on crash — foundational to Let It Crash.
Governor
A throttle that caps how quickly automation can act, so a force multiplier cannot run away.
Ch. 05 · Quiz1 / 5

Spot the issue

Assume `http.get` has no default timeout. What is the most important stability problem with this code, given Release It!'s patterns?

async function fetchUser(id) {
  const res = await http.get(`/users/${id}`);
  return res.body;
}
Ch. 05 · Quiz2 / 5

Multiple choice

A downstream service has been returning errors on most requests for the last 30 seconds. Each caller still issues every new request to it and waits for the failure, burning threads. What pattern is missing?

Ch. 05 · Quiz3 / 5

Multiple choice

A monolith uses one shared thread pool of 200 threads for every endpoint. One endpoint that calls a flaky third-party API starts hanging; within a minute every thread is parked on that one endpoint and the entire service is unresponsive. What pattern would best contain this?

Ch. 05 · Quiz4 / 5

True / False

A system that requires a nightly cron job to restart services, clear caches, and trim log files is fine as long as the cron job works.

Ch. 05 · Quiz5 / 5

Spot the issue

A retail site is overwhelmed by Black Friday traffic. Engineering's current plan is to keep every request queued and serve them all eventually rather than reject any. What stability pattern is being ignored?

Part 02

Design for Production

Ch. 6–11

Ch. 06

Case Study: Phenomenal Cosmic Powers, Itty-Bitty Living Space

A retail e-commerce system has the application-layer features ("phenomenal cosmic powers") but is crammed into infrastructure too small to support real-world demand ("itty-bitty living space"). Thanksgiving and Black Friday expose the gap between QA-validated functionality and production-required capacity, revealing how observability, headroom, and architectural assumptions break under peak load.

Ch. 06

Capacity vs. Demand Mismatch

The system worked in QA but had no headroom for real traffic peaks. Theoretical capacity from a sized-down test environment is not the same as production capacity under real traffic shape. Sizing is a measurement, not a guess.

Ch. 06

Holiday Traffic Is the Real Test

Black Friday and Cyber Monday expose weaknesses that uniform synthetic load never reveals. The calendar does not negotiate — you cannot postpone production hardening. Plan capacity for the worst hour of the worst day of the year.

Ch. 06

Diagnose Before You Treat

"Take the pulse, read the vital signs" — without metrics and observability you are guessing. Symptoms can have many causes; treating the wrong one wastes the only thing you have less of than CPU: time. You cannot fix what you cannot see.

Ch. 06

Compare Treatment Options

There is rarely one fix. Cost, risk, and time-to-recover differ between scaling up, scaling out, caching, and re-architecting. Pick the cheapest reversible move that buys time to think.

Ch. 06

Treatment May Not Work

Production fixes can fail or make things worse. You need a way to verify the change actually helped and to roll back if it did not. Every prod change ships with a rollback plan or it does not ship.

Ch. 06

Saturation Is Non-Linear

Queueing theory: latency grows nonlinearly as utilization approaches 100%. A system at 90% utilization has no headroom for spikes — the line bends sharply upward. Aim for utilization that leaves slack, not utilization that "looks efficient."

Ch. 06 · Vocab
Baseline
Pre-incident vital signs you measure normal behavior against to detect anomalies.
Heap dump / thread dump
Snapshot diagnostics for memory and concurrency bottlenecks.
Connection pool exhaustion
When all pooled connections are checked out and new requests block.
Throughput vs. latency
Throughput is requests/second completed; latency is time per request. They trade off under saturation.
Ch. 06 · Vocab
Headroom
Spare capacity above expected peak load.
Saturation
A resource at 100% utilization; latency grows nonlinearly as you approach this point.
Rollback plan
A pre-defined way to undo a change so failed fixes do not become permanent damage.
Ch. 06 · Quiz1 / 4

Multiple choice

QA has signed off on a release after running a uniform synthetic load test at 70% of last year's peak traffic. The team plans to ship the Tuesday before Black Friday. What is the strongest objection from this chapter?

Ch. 06 · Quiz2 / 4

Multiple choice

Your service is sustaining 92% CPU utilization on average and the team is celebrating "great efficiency." Latency p99 has tripled this week. What does queueing theory predict?

Ch. 06 · Quiz3 / 4

Spot the issue

Production is degraded. An engineer says, "I bet it's the database — let me bump the connection pool size from 50 to 200 and we'll see if it clears up." No metrics have been checked. What's wrong?

Ch. 06 · Quiz4 / 4

True / False

A production fix that resolves the visible symptom is sufficient evidence the change worked; no rollback plan is needed once the dashboards look green.

Ch. 07

Foundations

Establishes the physical and virtual substrate on which production systems run: data-center and cloud networking, plus the spectrum of compute options from physical hosts through VMs to containers. Their quirks — multi-homing, NAT, ephemeral IPs, oversubscription — leak into application behavior in production, and developers must understand the layers beneath the application.

Ch. 07

Data Center vs. Cloud Networking

Data centers give you predictable topology and physical control. Cloud trades that for elasticity but introduces opaque shared infrastructure, virtual networking, and noisy neighbors. You can choose either, but you cannot pretend the choice does not matter.

Ch. 07

Multi-Homed Hosts

A machine with multiple network interfaces — front-end and back-end VLANs — must bind to the right interface. "Listen on 0.0.0.0" is a common security mistake that exposes internal services to the public side. Know which interface your service is listening on, and why.

Ch. 07

Physical, Virtual, Container

Physical hosts give the best performance and predictability but longest provisioning. VMs add full OS isolation on a hypervisor but boot slowly and suffer from steal time. Containers share the host kernel for fast, dense, immutable deployment but offer weaker isolation. Pick the smallest unit that meets your isolation needs.

Ch. 07

Cattle, Not Pets

Treat machines as interchangeable, disposable units rather than uniquely named, hand-tended servers. Replace, do not repair. A pet has a name and a story; cattle have a tag and an autoscaler.

Ch. 07

The Network Is Not Transparent

Bandwidth, latency, MTU, name resolution, and packet loss all affect distributed systems. Pretending the network is a transparent wire eventually fails — usually loudly. The fallacies of distributed computing apply to you.

Ch. 07

Oversubscription and Steal Time

Hypervisors and switch fabrics allocate more virtual resources than physical ones exist, betting on non-simultaneous use. When the bet fails, your VM loses CPU time it cannot see in `top` — it shows up as latency. Your performance is not yours alone.

Ch. 07 · Vocab
Hypervisor
Software (Type 1 bare-metal, Type 2 hosted) that runs VMs by virtualizing hardware.
NIC
Network Interface Card — the hardware/virtual adapter connecting a host to a network.
VLAN
A logical Layer 2 segment over a shared physical switch fabric, used to isolate traffic.
NAT
Network Address Translation — rewrites source/destination IPs at a boundary.
Ch. 07 · Vocab
Container image
A read-only filesystem-layer bundle (e.g., OCI/Docker) used to instantiate containers.
Oversubscription
Allocating more virtual resources than physical resources exist, on the assumption of non-simultaneous use.
Steal time
CPU time the hypervisor took from your VM to give to another tenant; invisible to the guest OS.
East-west vs. north-south
East-west is service-to-service inside the data center; north-south is in/out from the internet.
Ch. 07 · Quiz1 / 4

Spot the issue

A backend service is deployed to a host with two NICs: `eth0` (public VLAN, 203.0.113.5) and `eth1` (back-end VLAN, 10.0.1.5). The startup config reads: What's wrong?

server:
  bind: 0.0.0.0
  port: 8080
Ch. 07 · Quiz2 / 4

Multiple choice

A VM running a JVM service shows p99 latency spikes despite the guest OS reporting only 35% CPU usage. `top` inside the VM looks healthy. What is the most likely explanation from this chapter?

Ch. 07 · Quiz3 / 4

Multiple choice

A team insists on giving every production host a unique hostname (`zeus`, `apollo`, `athena`), manually patches each one when CVEs land, and keeps a runbook describing the quirks of each machine. Which philosophy from this chapter are they violating?

Ch. 07 · Quiz4 / 4

True / False

Containers provide stronger isolation than VMs because they boot faster and use less memory.

Ch. 08

Processes on Machines

Once you have hosts, you need to run code on them. Every running process must get three things right: the executable code, its configuration, and its transparency — what it reveals about its own state. Nygard pushes for immutable, self-describing processes that externalize config and expose rich operational data.

Ch. 08

Build Once, Deploy Many

Immutable artifacts — containers, AMIs, hash-tagged JARs — ship unchanged from dev through prod. "Works on my machine" goes away because the machine is the artifact. Same bits in every environment, or every environment is a different system.

Ch. 08

Externalize Configuration

Configuration lives outside the artifact: environment variables, mounted files, config service. The same binary runs in dev, stage, and prod with different settings. The artifact is universal; the environment is specific.

Ch. 08

Build-Time vs. Runtime Configuration

Build-time settings are baked into the artifact; runtime settings come from the environment. Mixing them defeats immutability — you cannot rebuild every time a feature flag flips. Bake what does not change; inject what does.

Ch. 08

Transparency Built In

Processes must expose health, metrics, version, in-flight request counts, queue depths, and recent errors via standard endpoints — not require remote debuggers. A black box in production is a future incident.

Ch. 08

Liveness vs. Readiness

Liveness: "should I be killed?" — process is alive but may be broken. Readiness: "should I receive traffic?" — process is alive and prepared to serve. Confusing the two breaks orchestrators: liveness failures restart pods that just needed a warm-up. Two different signals, two different probes.

Ch. 08

Structured Logging with Correlation IDs

Logs should be machine-parseable — JSON or key=value — with a correlation ID propagated across services. Otherwise distributed logs are an unstitched pile that no aggregator can join. Eyeballs do not scale; grep does, if your logs let it.

Ch. 08

Don't Ship Debug to Production

Strip development-only behaviors before promotion: debug servlets, verbose stack traces, default credentials, sample apps. What helps in dev hurts in prod — usually as an attack surface. Hardening starts with removing what you do not need.

Ch. 08 · Vocab
Twelve-Factor App
Set of 12 conventions for cloud-native services (codebase, dependencies, config, backing services, etc.).
Immutable infrastructure
Servers and containers are never modified after deployment; changes require a new artifact.
Configuration drift
Divergence between hosts that should be identical because of manual changes.
Environment variable
Process-level key/value pair set by the launcher; canonical 12-factor config mechanism.
Ch. 08 · Vocab
Health check endpoint
A URL (often `/healthz`) returning process readiness or liveness for load balancers and orchestrators.
Liveness vs. readiness
Liveness asks "should I be killed?"; readiness asks "should I receive traffic?"
Graceful shutdown
Process drains in-flight work, deregisters from load balancers, then exits cleanly.
Correlation ID
A unique token propagated across services in a single request to stitch together distributed logs.
Ch. 08 · Quiz1 / 4

Spot the issue

A team's Kubernetes deployment uses the same probe URL for both liveness and readiness: The `/health` endpoint returns 200 only after a 90-second cache warm-up completes. Pods are getting killed and restarted in a loop. What's wrong?

livenessProbe:
  httpGet: { path: /health, port: 8080 }
readinessProbe:
  httpGet: { path: /health, port: 8080 }
Ch. 08 · Quiz2 / 4

Multiple choice

A team bakes the production database password into their Docker image at build time so "developers can just `docker run` it." Which principle does this violate most directly?

Ch. 08 · Quiz3 / 4

Multiple choice

After an outage spanning three microservices, the on-call engineer tries to reconstruct the request path. Each service logged its own activity, but the logs are plain text with no shared identifier. What practice from this chapter would have made this tractable?

Ch. 08 · Quiz4 / 4

Multiple choice

"We rebuild the artifact for every environment so dev, stage, and prod each get a tuned image." Why is this an antipattern?

Ch. 09

Interconnect

How services find and talk to each other at scale. Nygard walks through solutions sized for different contexts — from DNS plus a load balancer for a small shop, up to dynamic service discovery for fleets of containers — plus demand control (do not accept work you cannot do) and migratory virtual IPs for HA.

Ch. 09

Pick Your Scale

DNS plus a hardware load balancer works for small, stable infrastructure. Dynamic environments with short-lived containers need automated service discovery (Consul, etcd, ZooKeeper, Kubernetes Services). Match interconnect complexity to fleet volatility.

Ch. 09

DNS Caveats

DNS TTLs are advisory: client libraries and resolvers cache them, sometimes forever (the JVM's default). DNS changes do not propagate instantly. DNS is a hint, not a contract.

Ch. 09

L4 vs. L7 Load Balancing

L4 balancers route on TCP/UDP info — fast and dumb. L7 balancers inspect application protocols (HTTP headers, paths) and can do header-based routing, sticky sessions, retries, and circuit-breaking. L7 buys you smarts at the cost of latency.

Ch. 09

Demand Control

Do not politely accept requests you cannot serve. Shed load, queue with limits, signal back pressure. Silently absorbing more work than you can do guarantees an outage. Saying no is part of the API.

Ch. 09

Service Discovery Models

Client-side: caller queries the registry and picks an instance itself. Server-side: caller hits a stable endpoint that routes for them. Each has tradeoffs in latency, complexity, and dependency on the registry. Decide where the discovery logic lives.

Ch. 09

Migratory Virtual IPs

A floating IP can move between hosts on failover. Combined with health checks this gives basic active/passive HA without changing what clients see. A stable address over an unstable host.

Ch. 09 · Vocab
DNS TTL
How long a resolver may cache an answer; short TTLs enable fast failover but cost lookup volume.
Round-robin DNS
Returning multiple A records and rotating; cheap "load balancing" with no health awareness.
Sticky session
Load balancer pins a client to one backend (by cookie or source IP) so session state stays local.
Backpressure
A downstream signal telling upstream producers to slow down.
Ch. 09 · Vocab
Virtual IP (VIP)
An IP not bound to a single physical host; can be moved or shared.
Service registry
A directory (Consul, etcd, Eureka, ZooKeeper) where services register and clients look them up.
Sidecar proxy
A local proxy (Envoy, linkerd) handling discovery, retries, mTLS, and metrics for an adjacent app.
Anycast
One IP advertised from multiple locations via BGP; clients reach the topologically nearest one.
Ch. 09 · Quiz1 / 4

Spot the issue

A Java service caches DNS lookups for the lifetime of the JVM. Ops promotes a new database primary by updating the DNS A record (TTL 60s). Clients should follow within a minute, but the Java service keeps hammering the dead host until restart. What's wrong?

Ch. 09 · Quiz2 / 4

Multiple choice

A team needs to route HTTP requests to different backend pools based on URL path (`/api/v1` vs `/api/v2`) and inject retry behavior on idempotent calls. Which load balancer choice fits this requirement?

Ch. 09 · Quiz3 / 4

Multiple choice

An order service can sustain 800 req/s but is being driven at 1,500 req/s. The maintainers have configured the service to silently buffer every incoming request in memory because "dropping is rude." What does this chapter prescribe instead?

Ch. 09 · Quiz4 / 4

True / False

A two-person team running a handful of stable, long-lived VMs should still adopt dynamic service discovery (Consul/etcd) because it is the modern best practice.

Ch. 10

Control Plane

The control plane is the meta-system that manages your services: what is deployed where, what config they have, what they are doing, and how to change all of it safely. Every production system needs a control plane — but overbuilding it is as dangerous as not having one.

Ch. 10

Mechanical Advantage

The control plane gives operators leverage: one change pushes to thousands of nodes. Without it, ops effort scales linearly with fleet size and becomes the bottleneck before anything else. The control plane is how a small team runs a big system.

Ch. 10

How Much Is Right for You?

A two-person team running ten servers does not need Kubernetes. A 200-person org running thousands does. Control-plane complexity should match operational scale, not aspiration. Buying a 747 for a Cessna route is its own kind of outage.

Ch. 10

Development Is Production

The control plane itself is production software. It must be monitored, version-controlled, tested, and HA. An outage in your deploy pipeline is an outage that prevents fixing the outage. A broken control plane is a broken hospital.

Ch. 10

System-Wide Transparency

Aggregated metrics, logs, and traces across the fleet. You cannot manage what you cannot see. Three pillars: time-series metrics, centralized logs, distributed traces — each answering a different kind of question. Observability is non-negotiable infrastructure.

Ch. 10

Data Plane vs. Control Plane

The data plane handles user requests; the control plane decides where workloads run and how they are configured. The data plane must keep working when the control plane is down. Survive the meta-system going dark.

Ch. 10

Command and Control

APIs to issue operational commands across the fleet — restart, drain, scale, rollback. Must be authenticated, audited, and idempotent so retries do not double-apply. Every powerful button needs an audit log.

Ch. 10 · Vocab
Control plane
The management layer that decides where workloads run and how, separate from the data plane.
Data plane
The runtime path that handles user requests; should keep working even if the control plane is down.
Time-series metrics
Numeric measurements stamped with time (Graphite, Prometheus, InfluxDB).
Centralized log aggregation
Shipping logs from every process to a central indexed store (ELK, Splunk, Loki).
Ch. 10 · Vocab
Distributed tracing
Following one logical request across many services (Zipkin, Jaeger, OpenTelemetry).
Scheduler
Component deciding which node a workload runs on (Kubernetes scheduler, Mesos, Nomad).
Provisioning
Creating and configuring infrastructure itself — VMs, networks, security groups.
Idempotent operation
Can be applied multiple times with the same end result; essential for retryable control-plane commands.
Ch. 10 · Quiz1 / 4

Multiple choice

A two-person startup running ten EC2 instances is debating whether to stand up a full Kubernetes cluster, a service mesh, GitOps tooling, and a distributed tracing backend "to be ready for scale." What does this chapter advise?

Ch. 10 · Quiz2 / 4

Spot the issue

A team deploys a new release tool: a single shell script run from one engineer's laptop, with no monitoring, no version control, no tests, and no HA. It pushes to the entire fleet. What's wrong?

Ch. 10 · Quiz3 / 4

Multiple choice

During a regional outage, the team's Kubernetes API server is unreachable. User-facing requests are still being served by pods that were already running. What property does this illustrate?

Ch. 10 · Quiz4 / 4

Spot the issue

A fleet-wide `restart` command is implemented as a fire-and-forget POST. Network blip causes the operator to retry — and every node gets restarted twice, dropping all in-flight traffic. What's wrong?

Ch. 11

Security

A pragmatic security chapter aimed at developers, not pen-testers. Security is an ongoing process, not a feature: cover the OWASP Top 10 because it is derived from real attack data, apply the Principle of Least Privilege relentlessly, treat configured passwords as real credentials, and keep at it because the threat surface keeps moving.

Ch. 11

OWASP Top 10 Is Data, Not Opinion

OWASP aggregates real attack data from member organizations and updates every few years. It tells you what actually compromises systems — not what is theoretically scariest. Design against it because that is what attackers actually use. Defend against attacks that happen, not attacks that might.

Ch. 11

Injection Is Still Number One

SQL, command, and LDAP injection remain the most common compromise vector because mitigation is conceptually simple — never concatenate user input into a query — yet still skipped. Parameterize, always. There is no exception that is worth the bug.

Ch. 11

Principle of Least Privilege

A process gets the minimum privilege needed to do its job. Never run app servers as root, never grant a service DB account `GRANT ALL`, never give a deploy key broader access than the repo it deploys. The blast radius of a compromise is the privilege of the compromised account.

Ch. 11

Configured Passwords Are Real Passwords

Database passwords, API keys, TLS keys in config files are credentials. Treat them as credentials: do not commit them, do not bake them into container images, rotate them, audit access, prefer short-lived dynamic credentials issued by a vault. A secret in a Git history is a published secret.

Ch. 11

Defense in Depth

Multiple overlapping controls — network segmentation, WAF, app input validation, parameterized queries, output encoding, monitoring — so that one layer failing does not equal compromise. Assume any single control will fail.

Ch. 11

Security as an Ongoing Process

Patch dependencies (CVEs land weekly), rotate keys, review least-privilege grants, run security tests in CI, monitor for anomalies. A "secure" snapshot decays the moment it is taken. You are not secure; you are securing.

Ch. 11 · Vocab
OWASP
Open Web Application Security Project — vendor-neutral nonprofit producing the Top 10 and security tooling.
Injection
Attacker-controlled data is interpreted as code/query (SQLi, command injection, LDAP injection).
XSS
Cross-Site Scripting — attacker injects JavaScript that runs in another user's browser.
CSRF
Cross-Site Request Forgery — a logged-in user's browser is tricked into submitting an unintended request.
Ch. 11 · Vocab
Insecure deserialization
Reading attacker-controlled serialized objects, letting attackers execute code on deserialization.
Principle of Least Privilege
Granting an account only the minimum privileges necessary for its job.
Secrets vault
A system storing and dispensing credentials with auth, audit, leasing, and rotation (Vault, AWS Secrets Manager).
CVE
Common Vulnerabilities and Exposures — standardized ID for a known vulnerability.
Ch. 11 · Quiz1 / 4

Spot the issue

A new feature builds a SQL query like this: The team argues that since `id` comes from an authenticated session, it is safe. What's wrong?

String q = "SELECT * FROM orders WHERE customer_id = '" + req.getParameter("id") + "'";
stmt.execute(q);
Ch. 11 · Quiz2 / 4

Multiple choice

An app server runs as the OS root user, and the database account it connects with has `GRANT ALL ON *.*`. The team argues that least-privilege grants are "a hassle to manage." What is the real cost from this chapter's perspective?

Ch. 11 · Quiz3 / 4

True / False

Once a system has been hardened, completed a security audit, and shipped to production, it can be considered "secure" until the next major release.

Ch. 11 · Quiz4 / 4

Spot the issue

A team commits `application.yml` with the production database password to their git repo. After being caught, they rotate the password and force-push to remove the file. What's still wrong?

Part 03

Deliver Your System

Ch. 12–14

Ch. 12

Case Study: Waiting for Godot

An organization paralyzed by its own deployment process: releases require armies of people, hours of choreographed downtime, manual checklists, change-control boards, and weekend war rooms — yet still fail and roll back. The chapter dramatizes how treating deployment as a rare, dangerous event creates the very risk teams try to manage away.

Ch. 12

Deployment as a Feared Event

Rare manual releases accumulate huge batches of changes, which guarantees risk. Organizations respond with more process, which makes releases rarer, which makes them riskier — a vicious cycle. The fear of deploying is the cause of the danger of deploying.

Ch. 12

The Deployment Dance

Multi-team handoffs — dev to QA to release engineering to ops to DBAs — where everyone waits on someone else. The system is not broken; it is just that nobody can ship without choreographing twelve people. Friction is a deployment defect.

Ch. 12

Planned Downtime as Theater

Maintenance windows give the illusion of control but actually mask that the software was never built to be deployed safely while running. Downtime is not a plan — it is a confession.

Ch. 12

Big-Bang Releases Make Debugging Impossible

Months of accumulated changes shipped at once make it impossible to isolate which change caused a regression. Post-deploy debugging takes longer than the deploy itself. Big batches multiply blame; they do not multiply throughput.

Ch. 12

Deployability Is a Design Property

Like security or performance, "easy to deploy" must be engineered into the system from the start. It cannot be bolted on by ops afterward. You ship as well as you designed for shipping.

Ch. 12

Cost of Delay

Every day a feature sits undeployed has business cost, opportunity cost, and growing technical-debt cost as the gap between trunk and production widens. Undeployed code is inventory, and inventory rots.

Ch. 12 · Vocab
Deployment
Moving a new version of code from development onto production infrastructure.
Release
The business event of making functionality available to users; distinct from deployment.
Maintenance window
A pre-announced period of planned downtime during which a system is unavailable for upgrades.
Change Advisory Board (CAB)
A committee that reviews and approves proposed production changes.
Ch. 12 · Vocab
Big-bang release
A release containing many changes accumulated over a long period, deployed all at once.
Rollback
Reverting to a prior version when a release fails.
War room
A physical or virtual gathering during a high-risk release; a symptom of fragile deployment processes.
Lead time
The elapsed time from a code change being committed to running in production.
Ch. 12 · Quiz1 / 4

Multiple choice

An e-commerce company releases once a quarter. Each release requires a 50-page checklist, a Saturday-night war room, three teams on standby, and usually one rollback. Leadership responds by adding a new pre-release sign-off step from the CAB. What dynamic best describes what is happening?

Ch. 12 · Quiz2 / 4

True / False

A four-hour planned maintenance window is a sign that the team has carefully designed their deployment for control and safety.

Ch. 12 · Quiz3 / 4

Multiple choice

After a quarterly big-bang release, a checkout regression appears. The team spends four days bisecting because the release contained 470 commits across 14 services. Which idea most directly diagnoses the pain?

Ch. 12 · Quiz4 / 4

Spot the issue

A team finishes a feature on Monday but cannot ship it until the next monthly CAB meeting three weeks away. Management treats this as normal lead time. From a Release It! perspective, what is the core problem?

Ch. 13

Design for Deployment

Deployability is a first-class design concern. Software must be built so machines (not humans) can deploy it, repeatedly, without downtime. The chapter walks through automated build pipelines, the fallacy of planned downtime, the phases of a real deployment, and the canonical zero-downtime strategies — rolling, blue/green, and canary — including the schema-migration patterns that make them work.

Ch. 13

So Many Machines

Horizontal scaling and microservices put hundreds or thousands of nodes behind one service. Any manual step becomes mathematically impossible at this scale. Automate or surrender to the fleet.

Ch. 13

Build Pipeline

An automated chain takes a commit and produces deployable, versioned, immutable artifacts: compile, unit test, package, integration test, security scan, publish. The pipeline is the single path to production; nothing reaches prod that did not traverse it. One door in, monitored both sides.

Ch. 13

Immutable Artifacts

Do not patch running servers. Build a new image — container, AMI, hash-tagged JAR — with the new code baked in, deploy it, and discard the old one. Eliminates configuration drift and "works on my machine." The artifact is the unit of change.

Ch. 13

Phases of Deployment

A deploy is not one event. Relocatability: get the new artifact onto target machines. Installation: start new processes. Transition: shift traffic, run schema changes. Cleanup: retire old versions. Each phase has its own failure modes. Deploy is a workflow, not a moment.

Ch. 13

Rolling, Blue/Green, Canary

Rolling: replace instances in waves behind a load balancer; old and new run simultaneously. Blue/Green: maintain two parallel environments; flip traffic; instant rollback. Canary: send 1% then 5% then 25% to the new version while monitoring; promote or abort. Each gives you a different rollback story.

Ch. 13

Expand/Contract for Schemas

The cornerstone of zero-downtime schema changes. Expand: add the new column/endpoint alongside the old; deploy code that reads either. Migrate: backfill data, switch readers and writers. Contract: remove the old shape once nothing depends on it. Every intermediate state must be deployable.

Ch. 13

Decouple Deploy from Release

Use feature flags so code can be deployed dark (live but inactive) and turned on later. The technical act of shipping becomes independent of the business act of launching. Two switches, two timelines.

Ch. 13 · Vocab
Continuous Deployment
Every commit that passes the pipeline goes to production automatically.
Continuous Delivery
Every commit produces a release candidate that could go to production; final push may be a human decision.
Build pipeline
Automated, stage-gated path from commit to deployable artifact.
Immutable artifact
A built package never modified after creation; redeploying produces a new artifact.
Ch. 13 · Vocab
Rolling deploy
Replace instances in waves while keeping capacity above minimum; requires mixed-version compatibility.
Blue/Green deploy
Two parallel production stacks; flip a router to cut over; instant rollback by flipping back.
Canary deploy
Gradual traffic shift to the new version with automatic rollback on metric regression.
Feature flag
A runtime switch decoupling code deployment from feature release.
Ch. 13 · Vocab
Expand/contract
A three-phase migration pattern keeping every intermediate state deployable.
Drain
Stop sending new requests to an instance and let in-flight requests complete before shutdown.
Ch. 13 · Quiz1 / 4

Multiple choice

A platform team manages 1,800 service instances across three regions and is debating whether to keep using a wiki-based deploy runbook that ops runs by hand on each box. From a Release It! perspective, what is the strongest argument against the runbook?

Ch. 13 · Quiz2 / 4

Spot the issue

The team runs this during a rolling deploy. Old app instances are still serving traffic when the migration starts. What's the fundamental problem?

-- Migration v42 (deployed in lockstep with app v42)
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_v2 TO email;
Ch. 13 · Quiz3 / 4

Multiple choice

A team wants instant rollback for a high-risk release. Capacity cost is acceptable but mid-deploy mixed-version traffic is not. Which strategy fits best?

Ch. 13 · Quiz4 / 4

Spot the issue

A team commits a feature on Friday but wants to launch it for a Monday marketing event. They hold the merge until Sunday night so the code only reaches production at launch time. What pattern would let them ship Friday safely without coupling deploy to launch?

Ch. 14

Handling Versions

Once you can deploy any service at any time, multiple versions of your code coexist in production at every moment — across rolling-deploy windows, long-running clients, mobile apps that update on the user's schedule. The unifying principle is Postel's Law: be conservative in what you send, liberal in what you accept — operationalized through nonbreaking changes, version negotiation, and explicit contract testing.

Ch. 14

Postel's Law

The Robustness Principle from TCP: producers emit strictly correct, minimal output; consumers accept anything reasonable, ignoring fields they do not understand. This is the single most important rule for surviving version skew. Liberal in; conservative out.

Ch. 14

Nonbreaking Changes Only

Adding optional fields, endpoints, or enum values is safe. Removing fields, renaming them, tightening validation, or changing semantics is not. When you must break, version explicitly. Additive evolution buys decades of headroom.

Ch. 14

Tolerant Reader

A consumer-side pattern: parse only the fields you need, ignore the rest, never fail because the response contains something unexpected. Pairs with Postel's Law. Read what you came for; ignore the rest.

Ch. 14

Version Intolerance

A bug where a service rejects a perfectly valid request just because the version marker is unfamiliar — even though the content would have worked. TLS servers hanging up on TLS 1.3 ClientHellos despite supporting 1.2 is the classic example. Do not fail on novelty.

Ch. 14

Version Negotiation

Client and server explicitly agree on a wire version at session start, rather than guessing. Each side advertises what it speaks; both downshift to the highest common version. An explicit handshake beats a silent assumption.

Ch. 14

Contract Testing

Producers and consumers run automated tests against a shared contract artifact (Pact, OpenAPI spec). Breaking changes are caught in the pipeline, not at 3 a.m. in production. Each side proves it still honors the contract. Catch incompatibility in CI, not in customer logs.

Ch. 14 · Vocab
Postel's Law
Be conservative in what you send, liberal in what you accept — the Robustness Principle.
Tolerant Reader
Consumer-side parsing that extracts only the fields it needs and ignores the rest.
Version Intolerance
A defect where a participant rejects valid messages solely because of an unfamiliar version marker.
Breaking change
A change requiring every caller to update simultaneously (removed field, renamed field, narrowed type).
Ch. 14 · Vocab
Nonbreaking change
A change an unaware client can ignore safely (new optional field, new endpoint).
Contract test
An automated test verifying a service still satisfies the contract its consumers depend on.
Consumer-driven contract
A contract defined by what consumers actually use, captured as executable expectations.
Forward compatibility
Old code can handle data produced by newer code by tolerating unknown fields.
Ch. 14 · Vocab
Semantic versioning
MAJOR.MINOR.PATCH convention; MAJOR bumps signal breaking changes.
Ch. 14 · Quiz1 / 4

Multiple choice

A service receives JSON like `{"id": 42, "name": "Ada", "newFlag": true}`. The consumer was written before `newFlag` existed and crashes with "unexpected property: newFlag." Which principle is being violated?

Ch. 14 · Quiz2 / 4

Spot the issue

The team updates the API so `email` is now a required field on every PATCH, returning 400 if it is missing. Mobile clients on older app versions still send PATCHes without `email`. What is wrong with this change?

PATCH /v1/users/42
{ "email": "ada@example.com" }
Ch. 14 · Quiz3 / 4

True / False

TLS servers that hang up on TLS 1.3 ClientHellos despite supporting TLS 1.2 are an example of version intolerance.

Ch. 14 · Quiz4 / 4

Multiple choice

Team A owns a producer service. Team B owns three different consumers. They want to catch breaking changes before a deploy reaches production, without coordinating release schedules. Which approach best matches the chapter?

Part 04

Solve Systemic Problems

Ch. 15–17

Ch. 15

Case Study: Trampled by Your Own Customers

A retailer's big-bang brand-and-site launch concentrates risk into a single irreversible event. Months of QA and load testing fail to predict the real-world traffic shape, session lifetimes, and database contention that hit on day one. Sessions bloat, heaps run out, search and DB connections saturate, and the team scrambles through stabilization under public scrutiny.

Ch. 15

Aiming for Quality Assurance

QA validates features against requirements. It does not reveal how the system behaves under production-shaped load, traffic mix, or session lifetimes. Passing QA is not the same as production-ready.

Ch. 15

Load Testing Has Limits

Load tests used unrealistic scripts — synthetic browsing patterns, no think-time variance, small session counts. They never exposed the memory and locking pathologies real users would cause. Test the traffic shape, not just the traffic volume.

Ch. 15

Murder by the Masses

When real traffic hit, sessions exploded in memory, the cluster ran out of heap, search and database connections saturated, and one tier's failure cascaded across the others. Session bloat is the silent killer of launches.

Ch. 15

The Testing Gap

There is always a delta between what testing covers and what production demands. Closing the gap requires production-like data volumes, realistic session models, longevity tests, and explicit capacity targets. Production is always the test you did not run.

Ch. 15

Systemic Failure

The root cause was not a single bug; it was a chain of decisions across architecture, organization, marketing, and operations. No one component "broke," but the system as a whole could not survive its own success. The bug is not always in the code.

Ch. 15

Cold Start Is the Most Dangerous Moment

The first traffic after launch — caches empty, pools unwarmed, system most fragile — is when load hits hardest. A staged rollout that warms the system beats a midnight flip every time. Warm before you welcome.

Ch. 15 · Vocab
Session bloat
Uncontrolled growth of per-user state held in server memory, eventually consuming heap.
Cascading failure
A failure in one subsystem that triggers failures in callers and dependents.
Load test
A test that drives a system with synthetic concurrent traffic to measure throughput, latency, and breaking points.
Longevity test
A long-duration test designed to reveal slow leaks and accumulating state.
Ch. 15 · Vocab
Capacity
The maximum sustainable throughput a system can serve while meeting latency and error targets.
Synthetic traffic
Scripted, machine-generated requests; rarely matches the variance of real users.
Think time
The pause between user actions in a session; realistic variance is essential for accurate load modeling.
Cold start
The first traffic after launch, when caches are empty and pools unwarmed.
Ch. 15 · Quiz1 / 4

Multiple choice

A site's load tests ran for two hours with scripted browsing patterns, zero think-time variance, and a few hundred virtual users, and all targets were green. On launch day, real users blow out the heap within an hour. What is the most likely gap?

Ch. 15 · Quiz2 / 4

Spot the issue

A retailer plans a brand relaunch as a midnight cutover: DNS flips at 00:00, a marketing email blast goes out at 00:05, and TV ads run at 00:30. Caches are cold, JVMs just started, and connection pools are empty when the first real users arrive. What is the core risk?

Ch. 15 · Quiz3 / 4

Multiple choice

The postmortem for a failed launch concludes that no single component "broke," but architectural choices, marketing timing, organizational handoffs, and operational gaps combined to produce the outage. Which framing best captures this?

Ch. 15 · Quiz4 / 4

True / False

If a system passes a comprehensive QA suite covering all functional and integration requirements, it can be considered production-ready for launch.

Ch. 16

Adaptation

Production-ready is not a fixed state. Every system that survives must continuously adapt to new business needs, scale, and surrounding context, or it dies of accumulated rigidity. Adaptation spans three layers — process and organization, system architecture, and information architecture — and long-lived systems are designed for optionality and change, not for a finished state.

Ch. 16

Convex Returns and Optionality

Borrowed from Taleb: prefer designs where the upside is large and unbounded but downside is capped. Small reversible experiments compound; big irreversible bets crush teams when they fail. Adaptable architectures preserve future optionality.

Ch. 16

Conway's Law

"Team assignments are the first draft of the architecture." Org structure shapes the system, and changing one forces changes in the other. Design teams — autonomy plus alignment, two-pizza-sized — so the architecture you want is the architecture the org naturally produces. The org chart is the architecture diagram.

Ch. 16

Evolutionary Architecture (No End State)

There is no "finished" architecture. Build for incremental migration: loose coupling, service boundaries, no shared databases, the ability to replace any piece without a big-bang rewrite. Plan for the architecture to change without planning the change.

Ch. 16

Dependency Management

Treat every dependency — libraries, services, schemas — as a liability with a version, a contract, and a deprecation path. Semantic versioning, tolerant readers, and consumer-driven contracts let producers and consumers evolve at different rates. A pinned dependency is a paid debt; an unpinned one is a surprise.

Ch. 16

Releases Without Downtime

Decouple deploy from release. Ship dark, switch behavior with feature toggles, use canary and blue-green rollouts, roll forward not back. Schema changes use expand/contract. Continuous deployment is a chain of small reversible steps.

Ch. 16

Learning from Production

Production is the only environment that tells the truth. Run blameless postmortems that focus on systemic and contributing factors, not individuals. Capture corrective actions; feed learnings back into design and runbooks. Each incident is a free education paid for in pain.

Ch. 16

System-Wide Changes

Changes crossing service or team boundaries need explicit orchestration: backward-compatible rollouts, coordinated migrations, and the discipline that no service ever requires a simultaneous deploy with another. Lockstep coordination is a smell.

Ch. 16 · Vocab
Conway's Law
Organizations design systems that mirror their communication structure.
Two-pizza team
A team small enough to be fed by two pizzas; small enough to retain autonomy and ownership.
Autonomy and alignment
Teams choose how to ship (autonomy) within shared goals and platform constraints (alignment).
Feature toggle
A runtime switch decoupling code deployment from feature release.
Ch. 16 · Vocab
Canary deployment
Rolling a new version to a small slice of traffic first, then expanding if health holds.
Blue/green deployment
Running two production environments and shifting traffic atomically.
Expand/contract
A schema/API evolution pattern: expand additively, migrate, then contract.
Tolerant reader
A consumer that ignores unknown and missing optional fields.
Ch. 16 · Vocab
Blameless postmortem
A post-incident review focusing on systemic causes, not individuals.
Antifragile
A system that gains from disorder, strengthening from incidents rather than merely surviving them.
Ch. 16 · Quiz1 / 4

Multiple choice

A leadership team assembles three large cross-functional teams to build a target architecture. Two years in, the system has three sprawling shared databases, every release requires coordinating all three teams, and architectural seams keep blurring. Which concept best diagnoses this?

Ch. 16 · Quiz2 / 4

Spot the issue

After a Sev-1 outage, leadership runs a postmortem that opens with "who broke production?" and ends with a write-up naming the on-call engineer and recommending more careful review. What is wrong from a Release It! adaptation perspective?

Ch. 16 · Quiz3 / 4

Multiple choice

A platform team treats their current architecture as "done" and resists any migration, arguing that the design is finished and any change is failure. Which concept does this thinking violate?

Ch. 16 · Quiz4 / 4

Multiple choice

A team finds that every release of service A requires service B to deploy the same code at the same moment, or B errors out. According to the chapter, what is this a sign of?

Ch. 17

Chaos Engineering

Complex distributed systems fail in ways no one can fully predict from design. The only honest way to know your system is resilient is to break it on purpose, in production, under controlled conditions. Nygard traces the discipline from Netflix's Chaos Monkey through the Simian Army, lays out prerequisites, and frames game days as the practice that converts a fragile org into an antifragile one.

Ch. 17

Breaking Things to Make Them Better

Theoretical resilience claims are worthless until tested. Inject real failures so the system's actual recovery behavior — not its design diagrams — is what you trust. Untested resilience is not resilience.

Ch. 17

The Simian Army

Netflix's family of failure-injection agents: Chaos Monkey kills instances, Latency Monkey injects network delays, Conformity Monkey terminates non-conforming instances, Janitor Monkey cleans unused resources, Chaos Gorilla kills an entire Availability Zone, Chaos Kong takes out an entire Region. A failure-injection toolkit per failure scale.

Ch. 17

Prerequisites for Chaos

Do not break what you cannot observe. Required before injecting failure: production observability (metrics, logs, traces), automated recovery (autoscaling, health checks, circuit breakers), a defined steady-state hypothesis, an abort switch, and leadership consent. Chaos without observability is just damage.

Ch. 17

Blast Radius

The maximum scope an experiment can damage. Start with the smallest possible radius — one host, one canary, one cohort — verify safety mechanisms, then expand. Run one chaos at a time. Bound the experiment before you run it.

Ch. 17

Steady-State Hypothesis

A measurable definition of normal — e.g., "checkout success rate above 99.5%" — that the experiment must preserve. Deviation from steady state is the abort signal. Define success before you run the experiment.

Ch. 17

Game Days

Scheduled, scoped exercises where teams intentionally induce a failure — kill an AZ, sever a dependency, exhaust a pool — and exercise the human and technical response together. Game days surface runbook gaps and on-call weaknesses no automated test catches. Practice the failure when it does not cost you.

Ch. 17

The Antifragile Organization

Antifragility is an organizational property, not just a system property. Teams that practice failure, share blameless learnings, and route insight back into design get stronger with each incident. Fragile orgs hide incidents; antifragile orgs schedule them.

Ch. 17 · Vocab
Chaos engineering
Experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions.
Chaos Monkey
Netflix tool that randomly terminates production instances during business hours.
Simian Army
Netflix's broader failure-injection toolkit covering different failure modes and blast radii.
Game day
A planned exercise deliberately failing part of production to test recovery and human response.
Ch. 17 · Vocab
Blast radius
The bounded scope of impact of a chaos experiment.
Steady-state hypothesis
The pre-declared, measurable definition of normal used as success criterion and abort trigger.
Fault injection
Deliberately introducing faults (latency, errors, kills) into a running system.
Antifragility
A property where systems and orgs gain from disorder rather than merely tolerate it.
Ch. 17 · Vocab
Cattle, not pets
A philosophy where instances are interchangeable rather than uniquely nursed.
Resilience vs. robustness
Robustness resists failure; resilience recovers from it.
Ch. 17 · Quiz1 / 4

Multiple choice

A team's design diagrams claim the service "tolerates the loss of any single instance" because it runs three replicas behind a load balancer. The team has never killed an instance in production. From a chaos-engineering perspective, what is true?

Ch. 17 · Quiz2 / 4

Spot the issue

A team excited about chaos engineering schedules an experiment for next Friday: at 14:00 they will simultaneously kill one entire Availability Zone, inject 5s latency across all internal RPCs, and drop 30% of database connections. They have no defined steady-state metric, no abort criteria, and no consolidated dashboard. What is wrong?

Ch. 17 · Quiz3 / 4

Multiple choice

Before running a chaos experiment, an SRE team writes down "checkout success rate stays above 99.5% and p99 latency stays under 800ms for the duration." During the experiment, success rate drops to 97%. What role does that pre-declared metric play?

Ch. 17 · Quiz4 / 4

True / False

Antifragility is purely a property of the technical system; the org that operates it cannot be made antifragile.

Key Takeaways

01

Software has no value until it runs in production, and production exposes failure modes that no test environment can reproduce.

02

Faults are inevitable, but failures are designable — defend integration points, cap blast radius, and never assume a remote call returns.

03

Deployability is a first-class design property, not an ops afterthought; immutable artifacts and zero-downtime patterns belong in the architecture.

04

Postel's Law plus expand/contract migrations let you evolve schemas and APIs without coordinated lockstep releases.

05

Conway's Law makes org structure load-bearing; teams designed for autonomy produce architectures designed for change.

06

Antifragility is earned by breaking things on purpose under bounded blast radius — chaos engineering converts hope into evidence.