Redr · Study Guide
Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Martin Kleppmann
Unofficial AI-assisted study guide. Not affiliated with or endorsed by the author or publisher. For educational use — supplements, not replaces, the original work.
Contents
- 01Reliable, Scalable, and Maintainable Applications
- 02Data Models and Query Languages
- 03Storage and Retrieval
- 04Encoding and Evolution
- 05Replication
- 06Partitioning
- 07Transactions
- 08The Trouble with Distributed Systems
- 09Consistency and Consensus
- 10Batch Processing
- 11Stream Processing
- 12The Future of Data Systems
- Part 01 · Foundations of Data Systems01Reliable, Scalable, and Maintainable Applications02Data Models and Query Languages03Storage and Retrieval04Encoding and Evolution
- Part 02 · Distributed Data05Replication06Partitioning07Transactions08The Trouble with Distributed Systems09Consistency and Consensus
- Part 03 · Derived Data10Batch Processing11Stream Processing12The Future of Data Systems
Part 01
Foundations of Data Systems
Ch. 1–4
Reliable, Scalable, and Maintainable Applications
Introduces the three core concerns of data-intensive applications: reliability (working under adversity), scalability (coping with growth), and maintainability (productivity over time). Uses Twitter's home timeline as a recurring case study for read/write tradeoffs.
Reliability
A system continues performing its required function correctly even in the face of hardware faults, software bugs, and human error. Reliable systems are fault-tolerant — they anticipate faults and design to keep functioning.
Faults vs. Failures
A fault is one component deviating from its spec; a failure is the system as a whole stopping service. Fault tolerance is about preventing faults from cascading into failures.
Scalability Is Not a Label
Don't ask "is X scalable?" — ask "if load grows in *this particular way*, what are our options?" First describe load with load parameters (RPS, read/write ratio, cache hit rate, fan-out).
Percentiles, Not Averages
Response times follow a long-tail distribution. Track p50, p95, p99, p99.9 — tail latencies disproportionately affect your most valuable users (those with the most data and activity).
Scale Up vs. Scale Out
Vertical scaling uses a more powerful machine; horizontal scaling spreads load across many smaller ones. Real architectures mix both. Stateful systems are far harder to scale out than stateless ones.
Maintainability: Operability, Simplicity, Evolvability
Most cost is in ongoing maintenance, not initial build. Design for operations teams (good monitoring), future engineers (reduce accidental complexity), and changing requirements (evolvability).
Abstraction Beats Accidental Complexity
Essential complexity is inherent in the problem; accidental complexity is everything else. Good abstractions hide implementation details and let engineers reason about the problem rather than fight the tools.
- Fault Tolerance
- Anticipating faults and designing the system so failures do not occur even when components fail.
- Throughput
- Records processed per second, or total time for a job over a dataset of given size.
- Response Time
- The full duration a client waits, including network and queueing — distinct from pure server-side latency.
- Tail Latency
- The high percentiles (p99, p99.9) representing the slowest requests; often the ones felt most.
- Fan-out
- Number of downstream requests triggered per incoming request (e.g., a tweet to all followers).
- SLA / SLO
- Service Level Agreement/Objective — contractual targets for availability and performance.
- Operability
- Making it easy for operators to keep the system running well via monitoring and predictable behavior.
- Evolvability
- The ease of adapting the system to unanticipated future requirements.
Multiple choice
Your dashboard shows the average response time hovering at a healthy 80 ms, yet your top customers keep complaining the app feels slow. Which metric is most likely to reveal what they are experiencing?
True / False
A fault and a failure mean the same thing in Kleppmann's terminology.
Spot the issue
A team proposes: "We'll make the service scalable by buying a bigger box whenever load grows." What's the main weakness of this plan as a long-term scalability story?
Multiple choice
Which of the following is not one of the three pillars of maintainability Kleppmann identifies?
Data Models and Query Languages
Surveys the relational, document, and graph data models. Argues that the choice of data model is among the most important decisions in software design because it shapes how we think about the problem itself.
Relational Model
Proposed by Edgar Codd in 1970. Data lives in relations (tables) of tuples (rows). It won out by cleanly separating logical data from physical access paths — the query planner decides *how*.
Document Model
Self-contained JSON-like documents with nested one-to-many relationships. Good locality (one fetch returns the whole tree) but weaker at joins and many-to-many relationships.
Object-Relational Impedance Mismatch
In-memory objects have nesting and references; relational tables are flat. The translation layer (ORMs) reduces but never eliminates the friction — this motivated the document model.
Schema-on-Read vs. Schema-on-Write
Schema-on-write (relational): structure enforced at insert — like static typing. Schema-on-read (document): structure interpreted at query — like dynamic typing. The latter shines for heterogeneous data.
Declarative vs. Imperative Queries
SQL is declarative (what, not how) — the engine optimizes and parallelizes for free. Imperative code (MapReduce loops) locks in an execution strategy and is harder to optimize automatically.
Graph Data Models
When many-to-many relationships dominate — social graphs, web graphs, road networks — graphs (vertices + edges) are more natural than tables or trees. Queries traverse paths instead of joining.
Property Graphs vs. Triple-Stores
Property graphs (Neo4j): rich attributes on vertices and edges. Triple-stores (RDF): everything is a (subject, predicate, object) statement. SPARQL and Cypher are the respective query languages.
- Normalization
- Storing each fact once with references; reduces duplication, increases joins.
- Denormalization
- Duplicating data to speed reads, at the cost of keeping copies in sync.
- Foreign Key
- A field referencing another table's primary key — the join glue of relational schemas.
- MapReduce
- An imperative programming model: map each record, then reduce by key.
- Cypher
- Neo4j's declarative graph query language — pattern syntax like `(a)-[:KNOWS]->(b)`.
- SPARQL
- Query language for RDF triple-stores in the Semantic Web ecosystem.
- Datalog
- Older logic-based declarative language using facts and rules; basis for Datomic's queries.
- Locality
- Storage layout that keeps related data physically close, so one fetch returns it all.
Multiple choice
What is the central advantage of declarative query languages like SQL over imperative code that loops through records?
Spot the issue
A startup picks a document database because their domain has lots of one-to-many nesting and they like schema-on-read flexibility. Six months in, they need to render a social graph of "friends-of-friends-of-friends." What problem will they hit?
Multiple choice
You are storing data with widely varying optional fields per record — every row has a different mix of attributes. Which model is the most natural fit?
True / False
The property-graph model and the RDF triple-store model are essentially identical, just with different syntax.
Storage and Retrieval
Explores how databases store data on disk and find it again. Two families dominate: **log-structured** (LSM-trees, Cassandra, LevelDB) and **page-oriented** (B-trees, traditional RDBMS). Then contrasts OLTP with OLAP and column-oriented warehouses.
Log-Structured Storage
Append-only files: writes always go to the end. Sequential writes are dramatically faster than random ones on both HDDs and SSDs. Old data is reclaimed by periodic background compaction.
Hash Indexes
An in-memory hash map from key to file offset. Blazing-fast point lookups, but all keys must fit in RAM and range queries are impossible — you'd have to scan everything.
SSTables and LSM-Trees
SSTables store keys in sorted order on disk. LSM-trees buffer writes in an in-memory memtable, flush to sorted SSTables, and merge them in the background. Excellent write throughput.
B-Trees
The dominant index structure since the 1970s. Fixed-size pages (typically 4 KB) arranged in a balanced tree, updated in place. Requires a write-ahead log to survive crashes mid-write.
LSM vs. B-Tree Tradeoff
LSM: faster writes (sequential), slower reads (check many SSTables, use Bloom filters). B-tree: faster reads, but write amplification from WAL + page rewrites hurts SSD lifetime.
OLTP vs. OLAP
OLTP: many small, latency-sensitive operations for end users. OLAP: huge scans aggregating millions of rows for analysts. Same data, radically different access patterns — hence the data warehouse.
Column-Oriented Storage
For analytics, store each column's values together rather than each row's. Queries read only needed columns, and similar values cluster — enabling huge compression wins. Foundation of modern warehouses.
- Compaction
- Merging segment files to drop duplicates/tombstones and reclaim space.
- Memtable
- In-memory sorted buffer (red-black tree, skip list) holding recent writes before flush.
- Write-Ahead Log (WAL)
- Append-only log of changes written before applying them, for crash recovery.
- Write Amplification
- One logical write causing several physical writes — matters for SSD wear.
- Bloom Filter
- Probabilistic structure that quickly says "definitely not present" — skips disk reads.
- Clustered Index
- Row data stored inside the index itself, avoiding a separate heap lookup.
- Data Warehouse
- Read-optimized analytics database fed by extracts from OLTP systems.
- Star Schema
- A central fact table surrounded by dimension tables — the dominant warehouse pattern.
Multiple choice
An LSM-tree based engine and a B-tree based engine sustain the same workload. Which statement most accurately captures their trade-off?
Spot the issue
A team builds an analytics dashboard on top of their row-oriented OLTP database. Each report scans 200 million rows but uses only 4 of the 50 columns. The reports are unbearably slow. What's the core issue?
Multiple choice
What is the role of a Bloom filter in an LSM-tree?
True / False
B-trees on disk use a write-ahead log (WAL) primarily so they can compress pages more aggressively.
Encoding and Evolution
Examines how data is serialized for storage and transmission, with a sharp focus on how to evolve formats without breaking running systems. Backward and forward compatibility are mandatory for rolling upgrades and independent service deployment.
Encoding and Decoding
Converting in-memory objects into byte sequences for disk or wire, and back. Also called serialization, marshalling, or pickling — Kleppmann prefers "encoding" to avoid clashing with database transactions.
Avoid Language-Specific Formats
Java `Serializable`, Python `pickle`, Ruby Marshal: convenient but lock you into one language, often pose security risks (arbitrary code on deserialize), and lack versioning.
JSON, XML, CSV: Tradeoffs
Human-readable and universal, but ambiguous: JSON can't distinguish ints from floats, no native binary, schema-less by default. Verbose — size matters when you're sending billions of messages.
Schema-Based Binary Encodings
Thrift, Protocol Buffers, Avro: schemas yield compact byte layouts, type safety, and structured evolution. Numeric field tags (Thrift/Protobuf) or schema resolution (Avro) keep old and new code compatible.
Backward and Forward Compatibility
Backward: new code reads old data. Forward: old code reads new data. Both are required for rolling upgrades, where old and new versions of your service run simultaneously during deployment.
Data Outlives Code
A row written by code five years ago may be read by today's code. When new code reads-then-rewrites old data, it must preserve unknown fields — or it silently destroys data its predecessors wrote.
Three Styles of Service Communication
REST/SOAP: synchronous HTTP, resource-oriented. RPC (gRPC, Thrift): make remote calls look local — but they aren't. Async messaging (Kafka, RabbitMQ, actors): decouples sender from receiver in time and identity.
- Schema
- A formal description of field names, types, and optionality used by an encoding system.
- Protocol Buffers
- Google's binary format using numbered field tags; widely used in microservices.
- Apache Avro
- Binary format with reader/writer schema resolution and no tag numbers — very compact.
- REST
- API style built on HTTP verbs and URLs identifying resources.
- RPC
- Remote Procedure Call — making a network call masquerade as a local function call.
- Message Broker
- Async intermediary (Kafka, RabbitMQ) delivering messages from producers to consumers.
- Actor Model
- Concurrency model where isolated actors communicate only by async messages (Erlang, Akka).
- Rolling Upgrade
- Deploying a new version one node at a time while old and new versions coexist.
Multiple choice
During a rolling upgrade, new code is reading rows written by the previous version, and old code is briefly reading data written by the new version. Which compatibility properties are required?
Spot the issue
A microservices team standardizes on Java `Serializable` for inter-service messages because "every service is in Java anyway." Why does Kleppmann argue this is a bad choice even setting language aside?
Multiple choice
Which encoding format uses reader and writer schemas resolved at read time, with no field tag numbers in the encoded bytes?
Spot the issue
A service reads a record, modifies one field, and writes it back. The schema gained a new field last week that this old service doesn't know about. After the round-trip, what likely happens?
Part 02
Distributed Data
Ch. 5–9
Replication
Keeping copies of the same data on multiple machines — for latency, availability, and read throughput. Examines single-leader, multi-leader, and leaderless approaches, and the deep challenges of replication lag and conflict resolution.
Single-Leader Replication
One replica is the leader — all writes go to it. The leader streams a replication log to followers, which serve read-only queries. Simple, but the leader is a write bottleneck and SPOF.
Multi-Leader Replication
Several nodes accept writes and replicate to each other — useful for multi-datacenter or offline clients. The catch: concurrent writes to the same record on different leaders create conflicts that must be resolved.
Leaderless Replication (Dynamo-Style)
Clients write to several replicas in parallel and read from several. Quorums with `w + r > n` guarantee overlap. Used by Cassandra, Riak, DynamoDB. No leader to fail over, but conflict-handling logic moves to the client.
Synchronous vs. Asynchronous Replication
Synchronous: leader waits for follower ack — durable, but blocks on slow followers. Asynchronous: fast, but a leader crash can lose recent writes. Most production setups use semi-synchronous (one sync follower).
Replication Lag and Read-After-Write
Async followers can be stale. Read-your-writes consistency ensures a user sees their own update right after submitting it — usually by routing their next read to the leader for a window of time.
Conflict Resolution
Options: last-write-wins with timestamps (lossy — clock skew bites you), version vectors tracking causal history, or application-level merge functions / CRDTs that combine concurrent values deterministically.
Failover and Split Brain
When the leader dies, a follower is promoted. Done badly, two leaders end up accepting writes simultaneously — split brain — and the data corruption that follows is brutal to recover from.
Read Repair & Anti-Entropy
In leaderless systems, read repair fixes stale replicas during normal reads (when divergence is detected). Anti-entropy is a background process that compares and reconciles replicas independent of reads.
- Replica
- A node storing a full copy of the database.
- Replication Log
- An ordered stream of writes followers apply to stay in sync with the leader.
- Quorum
- Minimum nodes needed for an op to count — `w + r > n` ensures overlap.
- Eventual Consistency
- Given no new writes, all replicas converge — eventually.
- Monotonic Reads
- A user never sees time go backward when reading from different replicas.
- Version Vector
- Structure tracking which writes each replica has seen; detects concurrency.
- Sloppy Quorum
- Writes go to any reachable nodes when home nodes are down; later **hinted handoff** reconciles.
- Split Brain
- Two nodes both believing they are leader, both accepting writes — a corruption disaster.
Spot the issue
A team picks single-leader replication with the leader pinned to us-east-1 to serve users on three continents. European users complain about write latency. What's the core mismatch?
Multiple choice
In Dynamo-style leaderless replication with `n=5`, which configuration fails to guarantee that every read overlaps with the most recent write?
True / False
Last-write-wins conflict resolution using wall-clock timestamps is a safe default because NTP keeps clocks perfectly in sync.
Multiple choice
A user posts a comment, then immediately refreshes the page and the comment is missing. The system uses async replication and the read was served by a follower. Which consistency property would have prevented this confusing experience?
Spot the issue
After a leader crashes, automatic failover promotes a follower. The old leader recovers, can still reach some clients, and keeps accepting writes for a few seconds. What's the technical name for this disaster, and what's a standard mitigation?
Partitioning
Splitting a dataset (a.k.a. **sharding**) across many nodes so each piece lives on exactly one partition, enabling horizontal scaling. Covers key-range vs. hash partitioning, secondary indexes, rebalancing, and request routing.
Partitioning by Key Range
Continuous ranges of keys go to each partition — like volumes of a paper encyclopedia. Range scans are efficient, but ranges can become hot spots if access is skewed (e.g., today's timestamps).
Partitioning by Hash of Key
A hash function spreads keys uniformly across partitions. Solves skew, but you lose range-scan locality — a `WHERE id BETWEEN x AND y` now hits every partition.
Skewed Workloads and Hot Keys
Even hashing can't fix a celebrity problem — one key receiving extreme load. Mitigation is application-level: append a random suffix to spread writes across N pseudo-keys, then merge on read.
Document-Partitioned Secondary Indexes
Each partition keeps its own local index of its own documents. Writes are cheap (one partition), but queries must scatter to every partition and gather the results — expensive at scale.
Term-Partitioned Secondary Indexes
A global index is itself partitioned by the indexed term. Reads hit one partition. But every write may touch multiple partitions of the index, so updates are slower and often asynchronous.
Rebalancing Strategies
When nodes are added or removed, partitions must move. Strategies: fixed number of partitions (simplest), dynamic partitioning (split when too big), or partitions proportional to nodes. All aim to minimize data movement.
Request Routing
A client needs to know which node owns key K. Options: any node forwards as needed, a routing tier sits in front, or clients are partition-aware. Membership is coordinated by services like ZooKeeper.
- Partition (Shard)
- An independent subset of the data; each record lives in exactly one partition.
- Hot Spot
- A partition receiving disproportionate load — defeats the purpose of sharding.
- Consistent Hashing
- Mapping keys and nodes onto a ring so only nearby keys move on membership change.
- Rebalancing
- Moving partitions across nodes to maintain a fair load distribution.
- Local Index
- A secondary index covering only the partition it lives on (document-partitioned).
- Global Index
- A secondary index partitioned by indexed term, independent of primary sharding.
- Scatter/Gather
- Query pattern fanning out to all partitions and merging results.
- ZooKeeper
- Coordination service often used to track partition membership and elect leaders.
Spot the issue
A time-series system shards on a key range over the timestamp column. As soon as it launches, one partition is saturated while the others sit idle. What's wrong?
Multiple choice
Your secondary index is document-partitioned (a local index per partition). What is the cost profile?
True / False
A good hash function eliminates all hot-spot problems, including those caused by celebrity keys.
Multiple choice
You add three nodes to a 12-node cluster using consistent hashing. Roughly what fraction of keys are remapped?
Transactions
Transactions bundle reads and writes into all-or-nothing units, taming error handling. The chapter dissects ACID, the surprising variety of **isolation levels**, and exactly which concurrency anomalies each one prevents.
ACID, Examined
Atomicity (all or nothing), Consistency (app-level invariants — not really a DB property), Isolation (concurrent txns don't interfere), Durability (committed data survives crashes). The meaning varies wildly between databases.
Read Committed
The default in many systems. Prevents dirty reads (no uncommitted writes visible) and dirty writes (no overwriting uncommitted data). Cheap to implement, but allows non-repeatable reads and read skew.
Snapshot Isolation (a.k.a. Repeatable Read)
Each transaction sees a consistent snapshot taken at its start. Implemented with MVCC (multi-version concurrency control). Prevents read skew — great for long-running read-only transactions and backups.
Lost Updates
Two transactions both do read-modify-write on the same row; one update vanishes. Fixes: atomic operations (`UPDATE ... SET x=x+1`), explicit locking (`SELECT FOR UPDATE`), automatic detection, or compare-and-set.
Write Skew & Phantoms
Two transactions read overlapping data, then each writes a *different* object based on what they read — together they break an invariant. Phantoms are rows that don't exist at read time but appear before write. Snapshot isolation does NOT prevent this.
Serializable Isolation
The gold standard: transactions appear to run one at a time. Three real implementations: actual serial execution (single-threaded, e.g., VoltDB), two-phase locking, or serializable snapshot isolation (SSI).
Two-Phase Locking (2PL)
Shared locks for reads, exclusive for writes, all held until commit. Strict serializability, but performance is poor, deadlocks are common, and tail latency is unpredictable. The classical solution — rarely the right modern one.
Serializable Snapshot Isolation (SSI)
Optimistic: run on a snapshot, detect at commit whether any read became stale, and abort if so. Strong guarantees with much better concurrency than 2PL. Used by PostgreSQL's SERIALIZABLE and FoundationDB.
- Atomicity
- All of a transaction's writes apply, or none do.
- Dirty Read
- Reading data another transaction has written but not committed.
- Dirty Write
- Overwriting data another transaction has written but not committed.
- MVCC
- Multi-Version Concurrency Control — readers see snapshots without blocking writers.
- Phantom
- A row that appears (or vanishes) when a query is re-run within a transaction.
- Compare-and-Set
- Atomic op updating only if the prior value still matches expectation.
- Predicate Lock
- A lock on all objects matching a search condition, including future ones.
- Index-Range Lock
- Coarser approximation of a predicate lock — locks an index range.
Spot the issue
Two on-call doctors each have an exclusive-or schedule rule: "at least one must always be available." Both are currently on call. They simultaneously open the app and request to go off-call. The database uses snapshot isolation. Each transaction reads "the other is on call" and proceeds to mark itself off. Result: nobody is on call. What's the name for this anomaly?
Multiple choice
Which technique does PostgreSQL's `SERIALIZABLE` mode use?
Multiple choice
Two clients both run `balance = SELECT balance; UPDATE ... SET balance = balance + 100;`. Each reads $500, each writes $600. Final balance is $600, not $700. Which fix does not correct this lost-update problem?
True / False
The Consistency in ACID is a property the database enforces on your behalf.
The Trouble with Distributed Systems
A grim catalog of all the ways distributed systems go wrong: partial failures, dropped packets, clock skew, process pauses. The lesson: assume any component might be slow, unreliable, or lying — and handle it explicitly at the algorithm level.
Partial Failures
A single computer either works or is clearly broken. In a distributed system, some parts work while others fail — nondeterministically. You often have *no way to know* whether the remote node received your request.
Unreliable Networks
Packets can be lost, delayed, duplicated, or reordered. A timeout never tells you whether the request, the response, or the entire node was lost. This ambiguity is the source of most distributed-system bugs.
Variable Latency from Queueing
Network delay is dominated by queueing — at switches, in TCP stacks, in OS schedulers. There is no theoretical maximum delay. Picking a timeout is always a trade-off between false positives and slow failure detection.
Unreliable Clocks
Time-of-day clocks drift, get jumped backward by NTP, and disagree across machines. Monotonic clocks only go forward and are great for measuring durations on one machine — but are meaningless across machines.
Process Pauses
GC, VM live migration, OS scheduling, paging — any of these can pause your process for seconds or minutes without warning. Long after a node's lease expired, it may still believe it's the leader. The wall clock doesn't care.
The Truth Is Defined by the Majority
A node cannot trust its own perception. The system relies on a quorum of nodes agreeing. Even a node loudly claiming to be the leader may have been deposed long ago — the network just hasn't told it yet.
Byzantine vs. Crash-Stop Faults
Most internal systems assume crash-stop or crash-recovery (nodes fail by stopping). Byzantine faults — where nodes lie or behave maliciously — require much more expensive algorithms and are usually only needed in adversarial settings.
Fencing Tokens
Each lock acquisition hands out a monotonically increasing token. The resource being protected rejects any write carrying an older token. Even a paused, stale leader can no longer corrupt data.
- Partial Failure
- Some parts fail while others keep working, nondeterministically.
- Network Partition
- Network splits into groups that cannot reach each other (netsplit).
- Timeout
- Time limit after which a remote op is assumed failed — no other way to tell.
- Monotonic Clock
- Clock that only advances; for measuring intervals on one machine.
- Time-of-Day Clock
- Wall-clock time, NTP-synchronized — subject to jumps and skew.
- Clock Skew
- Difference between two clocks at the same real instant.
- Lease
- A time-bounded lock; authority expires unless renewed.
- Fencing Token
- Increasing token from a lock service; old tokens get rejected by the resource.
Spot the issue
A distributed lock service hands out leases. A node acquires the lock, then GC-pauses for 90 seconds. Its lease expires; another node acquires the lock and starts writing. Then the original node wakes up and also writes — *still believing it holds the lock*. What mechanism prevents this corruption?
Multiple choice
A network request times out. What can you conclude?
True / False
A monotonic clock can be safely used to compare timestamps across machines in a cluster.
Multiple choice
Why is the assumption of crash-stop / crash-recovery failures, rather than Byzantine failures, usually acceptable for an internal company datacenter?
Consistency and Consensus
What guarantees *can* algorithms provide despite faults? Introduces linearizability, causality, and consensus — showing they're reducible to one another, and underlie everything from leader election to atomic commits.
Linearizability
The illusion that there's one copy of the data and every operation is instantaneous in real-time order. The strongest single-object guarantee — and the most expensive. Generally incompatible with availability during network partitions.
The CAP Theorem (in its proper form)
During a network partition, a system must give up either linearizable consistency or availability. Note: CAP is much narrower than popularly described — it ignores latency, doesn't apply when no partition exists, and isn't a 3-way trade.
Causality and Causal Consistency
A defines a partial order: A happened-before B if B could have known about A. Causal consistency preserves this order without enforcing one global order — weaker than linearizability, stronger than eventual consistency, often the sweet spot.
Total Order Broadcast
A protocol delivering messages to all nodes in the same order, with no loss. Equivalent in difficulty to consensus — together they form the basis of replicated state machines.
Distributed Consensus
Getting nodes to agree on one value — the next leader, the next log entry. Paxos, Raft, Zab, VSR all solve it. All require a majority of nodes to make progress; a minority cannot decide alone.
Two-Phase Commit (2PC)
A coordinator asks all participants to prepare; if all vote yes, it tells them to commit. Atomic across nodes — but blocking: if the coordinator dies after votes, participants are stuck holding locks. Notoriously brittle in production.
The FLP Impossibility Result
Fischer-Lynch-Paterson (1985): no deterministic consensus algorithm can guarantee termination in a fully asynchronous system with even one faulty process. Real algorithms cheat by assuming partial synchrony (timeouts).
Coordination Services (ZooKeeper, etcd)
Tiny, highly-available systems built on a consensus core, offering primitives the rest of your stack can lean on: leader election, distributed locks with fencing tokens, service discovery, config. Don't roll your own.
- Linearizability
- Recency guarantee: after a write completes, every subsequent read sees it (or newer).
- Serializability
- Transactions appear to run serially — orthogonal to linearizability.
- Strict Serializability
- Serializable + linearizable — the strongest possible correctness guarantee.
- Causal Consistency
- Preserves happened-before order without forcing total ordering.
- Total Order Broadcast
- All nodes deliver messages in the same order, none lost (atomic broadcast).
- Consensus
- Agreeing on one value despite faults; equivalent to 2PC, linearizable CAS, total order broadcast.
- Epoch Number
- Monotonically increasing leader-reign ID (Raft term, Paxos ballot). Old epochs are ignored.
- FLP Result
- No deterministic consensus algorithm terminates under pure async + one fault.
Multiple choice
Which set of problems is provably reducible to one another — meaning a solution to one yields a solution to the others?
Spot the issue
A team picks 2PC (two-phase commit) to coordinate writes across three microservices on the hot path of every user request. In production they see locks held forever after coordinator restarts. What's the structural problem with this choice?
True / False
The CAP theorem forces every distributed system to choose between consistency, availability, and partition tolerance at all times.
Multiple choice
What does the FLP impossibility result say, and how do real consensus systems work around it?
Part 03
Derived Data
Ch. 10–12
Batch Processing
Processes a large, bounded input to produce output over minutes to days. Starts with Unix pipelines, examines MapReduce and HDFS, and arrives at modern dataflow engines like Spark and Flink that improve on MapReduce's limits.
The Unix Philosophy
Small composable tools, each doing one thing well, chained via pipes through a uniform interface (text streams). Loose coupling enables powerful ad-hoc processing — the original distributed dataflow system.
The MapReduce Model
Two stages: map extracts key-value pairs from records; reduce aggregates values sharing a key. The framework handles partitioning, sorting, fault tolerance, and the shuffle between — an enormous abstraction win.
Distributed Filesystem (HDFS)
A shared-nothing storage layer spreading files across commodity machines with replication for fault tolerance. Stores raw data with no enforced schema — you interpret structure at read time ("sushi principle").
The Shuffle and Sort-Merge Join
The shuffle sends mapper output to the right reducer based on key. Sort-merge joins have mappers tag records with their source; reducers see all records for a key together and merge them.
Broadcast Hash Joins
When one side fits in memory, ship the whole small dataset to every mapper as a hash table. No shuffle needed. Far faster than sort-merge joins when applicable — the default for star-schema fact-to-dimension joins.
Fault Tolerance via Re-Execution
MapReduce materializes intermediate output to disk and re-runs failed tasks. Deterministic and recoverable — but the I/O between stages is the dominant cost, especially for multi-step jobs.
Dataflow Engines (Spark, Flink, Tez)
Treat the whole workflow as one DAG of operators rather than a chain of independent jobs. Keep intermediate data in memory, skip unneeded materialization, and recover by recomputing lost partitions from lineage.
Graph Processing (Pregel/BSP)
Iterative graph algorithms (PageRank, shortest path) don't fit MapReduce. The Pregel model runs in supersteps: each vertex receives messages, updates state, sends messages, then a synchronization barrier — repeat until convergence.
- Batch Processing
- Bounded-input job optimized for throughput, not latency.
- MapReduce
- Google's programming model: parallel map + reduce across a cluster.
- HDFS
- Hadoop Distributed File System — replicated commodity block storage.
- Shuffle
- The phase that moves mapper outputs across the network to the right reducer.
- Sort-Merge Join
- Sort both inputs by key, then merge in a single sequential pass.
- Broadcast Hash Join
- Ship a small side to every node as a hash table; no shuffle.
- Materialization
- Writing intermediate output to durable storage before the next stage reads it.
- Sushi Principle
- "Raw data is better" — store as-is, interpret at read time.
Multiple choice
A MapReduce job joins a 1 TB clickstream against a 50 MB user-profile dimension table. Which join strategy is dramatically faster?
Spot the issue
A team builds a 12-stage MapReduce pipeline. Each stage's output is written to HDFS before the next stage reads it back. The whole pipeline is slow even though every stage is simple. What did they miss that Spark/Flink/Tez would do differently?
True / False
Iterative graph algorithms like PageRank fit naturally into the classic two-stage MapReduce model.
Multiple choice
What is the shuffle in MapReduce?
Stream Processing
Handles **unbounded**, continuously arriving data with low latency. Covers messaging systems, change data capture, event sourcing, joins, windowing, and the path to exactly-once semantics. Blurs the line between databases and queues.
Events and Streams
An event is a small, immutable record of something that happened at a point in time. Producers emit, consumers process. Unlike batch, the dataset is conceptually infinite — the job never "finishes."
AMQP-Style vs. Log-Based Brokers
AMQP/JMS brokers (RabbitMQ) delete messages after acknowledgment — route-and-forget. Log-based brokers (Kafka) append events to a durable partitioned log; consumers track their own offsets and can replay history.
Change Data Capture (CDC)
Extract writes from a database's transaction log as a stream of events. Downstream search indexes, caches, and warehouses subscribe to stay in sync — eliminating the dual-write inconsistency problem.
Event Sourcing
Don't overwrite state — store an immutable log of events describing every change. Current state is derived by replaying. Provides full audit history and lets you build new derived views any time by reprocessing the log.
Stream Joins
Stream-stream: match events within a time window. Stream-table: enrich events from a slowly-changing reference table (often a local materialized copy). Table-table: keep a materialized join updated by changes on either side.
Windowing
Aggregations must happen over windows. Tumbling: fixed, non-overlapping (every minute). Hopping: overlap on a fixed cadence. Sliding: every event within a duration. Session: bursts of activity separated by gaps.
Event Time vs. Processing Time
Event time: when it actually happened. Processing time: when the stream processor saw it. Network delays, restarts, and out-of-order delivery push them apart — demanding watermarks and late-event policies.
Exactly-Once Semantics
Achieved (effectively) by combining: idempotent operations, atomic commit of output + consumer offset, and checkpointing (Flink-style distributed snapshots) so a failure replays from a consistent point with no duplicated effects.
- Stream
- Unbounded, incrementally available sequence of records (events).
- Producer / Consumer
- Producer publishes events to a stream; consumer subscribes and processes them.
- Kafka
- Distributed log-based broker storing events in partitioned, replicated logs.
- Consumer Offset
- Cursor tracking how far a consumer has progressed in a partition.
- Watermark
- Heuristic timestamp asserting no earlier events will arrive — closes windows.
- Backpressure
- Mechanism for slow consumers to signal upstream to slow down.
- Idempotent Operation
- Applying it many times has the same effect as once — safe to retry.
- Checkpoint
- Periodic durable snapshot of operator state for failure recovery.
Multiple choice
What is the key architectural difference between an AMQP/JMS-style broker (like RabbitMQ) and a log-based broker (like Kafka)?
Spot the issue
A service does dual-writes: it writes to PostgreSQL and to Elasticsearch on every update. Periodically the search index diverges from the database. What's the standard fix?
Multiple choice
A stream processor must aggregate sensor readings by "the minute they were generated," but readings arrive out of order due to network delays. What two concepts must the system separate, and what mechanism handles late events?
True / False
"Exactly-once" stream processing is achieved by guaranteeing each message is physically delivered exactly one time over the network.
The Future of Data Systems
Synthesizes the book's threads into a vision: applications composed of specialized storage and processing systems, kept in sync by event streams with one as the system of record. Closes with a frank discussion of end-to-end correctness, ethics, and engineer responsibility.
Unbundling the Database
Real apps already use many datastores: OLTP DB, search index, cache, warehouse. Rather than hide that under one monolith, embrace it: a durable event log is the "outer join engine," with specialized derived systems subscribing to it.
The Stream as System of Record
The event log is the authoritative source of truth. All other databases (indexes, caches, aggregates) are derived views that can be rebuilt by replaying. This kills the dual-write inconsistency class of bugs.
Lambda & Kappa Architectures
Lambda: a batch layer for accuracy + reprocessing alongside a speed layer for low-latency approximate results, merged at query. Kappa simplifies this: *one* stream pipeline that can also replay history.
The End-to-End Argument
Lower layers (TCP checksums, DB transactions) are not enough. Correctness must be enforced at the application boundary. Idempotency keys, request IDs, end-to-end checksums catch bugs lower layers miss.
Loose Coupling via Async Events
Synchronous service calls create tight coupling and cascading failures. Async event streams decouple producers and consumers in time and identity — letting each component evolve independently.
Reads as Subscriptions
A read against a derived view is logically a subscribe to a long-running query. Treating reads as streams (push updates, materialized views, GraphQL subscriptions) blurs the read/write line and powers responsive UIs.
Correctness Without Distributed Transactions
True distributed serializable transactions (2PC across services) are expensive and fragile. Prefer deterministic single-leader writes to a log, idempotent consumers, async compensation. Get correctness without the brittle coordination.
Ethics, Privacy, and Surveillance
Data systems have real social consequences: bias in algorithmic decisions, surveillance capitalism, predictive feedback loops. Engineers must weigh consent, minimization, transparency, and harm. Data is never neutral.
- System of Record
- The canonical authoritative source of truth for a piece of data.
- Derived Dataset
- Built from a source by transformation; always regeneratable.
- Unbundling
- Decomposing the monolithic database into composable parts (log, index, cache, query).
- Lambda Architecture
- Dual pipeline: batch (accurate) + streaming (timely), merged at query time.
- Kappa Architecture
- Simplification of Lambda using a single stream pipeline for both real-time and reprocessing.
- Idempotency Key
- Unique client-generated request ID so the server can deduplicate retries.
- End-to-End Argument
- Reliability checks belong at the endpoints, not intermediate layers.
- Surveillance Capitalism
- Business model monetizing captured behavioral data — raising data-system ethics stakes.
Multiple choice
In Kleppmann's "unbundled database" vision, what plays the role of the system of record?
Spot the issue
A team layered TCP checksums, DB transactions, and idempotent message-broker delivery — yet a user is still occasionally double-charged. What's the principle they're forgetting?
Multiple choice
Lambda architecture vs. Kappa architecture — what's the simplification Kappa proposes?
True / False
Kleppmann argues that engineers can treat data systems as ethically neutral — the responsibility for misuse lies entirely with business stakeholders.
Key Takeaways
Reliability, scalability, maintainability are the three non-negotiable axes — and percentiles, not averages, tell you the truth.
Storage engines come in two flavors — log-structured (LSM) and page-oriented (B-tree) — with opposite write/read tradeoffs and write amplification profiles.
Schema evolution with backward AND forward compatibility is mandatory for any service deployed continuously; choose binary schema formats (Avro, Protobuf) for this.
Distributed systems fail partially, in nondeterministic ways — clocks lie, networks drop packets, processes pause. Defend with quorums, leases, and fencing tokens.
Linearizability, consensus, total order broadcast, and atomic commit are reducible to one another — and all require a majority of nodes to be live.
The future is a durable event log as the system of record, with specialized derived views (indexes, caches, warehouses) kept in sync via streams — unbundling the database.