Redr
1 / 206

Redr · Study Guide

Database Internals

A Deep Dive into How Distributed Data Systems Work

Alex Petrov

Unofficial AI-assisted study guide. Not affiliated with or endorsed by the author or publisher. For educational use — supplements, not replaces, the original work.

Contents

Part 01
Storage Engines
  • 01Introduction and Overview
  • 02B-Tree Basics
  • 03File Formats
  • 04Implementing B-Trees
  • 05Transaction Processing and Recovery
  • 06B-Tree Variants
  • 07Log-Structured Storage
Part 02
Distributed Systems
  • 08Introduction and Overview
  • 09Failure Detection
  • 10Leader Election
  • 11Replication and Consistency
  • 12Anti-Entropy and Dissemination
  • 13Distributed Transactions
  • 14Consensus

Part 01

Storage Engines

Ch. 1–7

Ch. 01

Introduction and Overview

Establishes the conceptual map of a DBMS, decomposing it into transport, query processor, execution engine, and storage engine, then introduces the taxonomies — memory vs. disk resident, row vs. column, OLTP vs. OLAP, mutable vs. immutable — that classify every storage engine in the book.

Ch. 01

DBMS Architecture Layers

A database is split into a transport subsystem (client and peer communication), a query processor (parser, optimizer), an execution engine, and a storage engine. Understanding these boundaries clarifies where each design decision lives — the storage engine alone owns durability, concurrency, recovery, and on-disk layout.

Ch. 01

Memory-Resident vs. Disk-Resident

In-memory engines treat RAM as primary and disk as a log/backup; disk-based engines treat disk as primary and RAM as a cache. This single choice cascades into data-structure selection: T-trees and skip lists for memory-resident, B-trees and LSM-trees for disk-resident.

Ch. 01

Row-Oriented vs. Column-Oriented

Row stores keep all fields of a record contiguous, which is ideal for OLTP point lookups and full-row reads. Column stores keep values of the same attribute contiguous, which is ideal for OLAP scans, aggregations, and aggressive compression because columns are homogeneous in type.

Ch. 01

OLTP vs. OLAP vs. HTAP

OLTP is many small, latency-sensitive transactions touching few rows; OLAP is long-running aggregate queries over huge data sets; HTAP systems try to serve both from one engine. Workload shape — not the data model — determines which storage engine fits.

Ch. 01

Mutable vs. Immutable Storage

Engines that update records in place (B-trees) optimize for read latency and predictable space use. Engines that only append (LSM-trees) optimize for write throughput by turning random writes into sequential ones, at the cost of background compaction work.

Ch. 01

Comparing Databases Is Hard

Benchmarks, feature lists, and CAP-style labels mislead more than they inform. The only reliable basis for comparison is understanding how an engine actually handles writes, indexes, durability, and concurrency under your specific workload.

Ch. 01 · Vocab
Storage engine
The DBMS layer that organizes data on the underlying medium and exposes read/write/update/delete primitives.
Query optimizer
Transforms a parsed query into an efficient execution plan by choosing join order, access paths, and indexes.
OLTP
Online Transaction Processing — many short, latency-sensitive transactions.
OLAP
Online Analytical Processing — long-running, read-mostly queries over large data sets.
Ch. 01 · Vocab
HTAP
Hybrid Transactional/Analytical Processing — one engine serving both workloads.
Buffer pool
In-memory cache of disk pages managed by the storage engine to amortize I/O.
Page
The fixed-size unit of disk-to-memory transfer, aligned to filesystem or hardware block size.
Transport subsystem
The layer that manages client and inter-node communication.
Ch. 01 · Quiz1 / 4

Multiple choice

Your team is benchmarking three databases for a workload of long-running aggregate queries across billions of rows with very few writes. Which storage-engine property matters most for picking a winner?

Ch. 01 · Quiz2 / 4

Multiple choice

A vendor claims their engine is "the fastest database" based on a generic benchmark and a CAP-style label. According to the chapter, what is the most reliable way to actually compare engines?

Ch. 01 · Quiz3 / 4

Spot the issue

A team picks an LSM-tree engine for a read-latency-critical OLTP workload with very few writes, citing "LSM is the modern choice." What's the main risk?

Ch. 01 · Quiz4 / 4

True / False

The storage engine is the DBMS layer responsible for parsing SQL and choosing an execution plan.

Ch. 02

B-Tree Basics

Motivates why B-trees became the dominant on-disk index by contrasting them with binary search trees and hash indexes, then walks through their core invariants: high fanout, balanced height, and block-aligned nodes. Establishes the vocabulary needed before the operational treatment in Chapter 4.

Ch. 02

Why Not Binary Search Trees on Disk

A BST has fanout 2 and unpredictable balance, so on disk it means deep trees and many random I/Os per lookup. Worse, each node is far smaller than a disk block, wasting almost the entire page on every read.

Ch. 02

Block-Oriented Design

B-trees pack hundreds or thousands of keys per node so a single page read yields many comparisons. Node size is deliberately matched to the underlying page/block size — the unit of physical I/O — so no read is ever wasted.

Ch. 02

High Fanout, Low Height

Because each internal node has hundreds of children, even billion-record B-trees are only 3-4 levels deep. Tree height is logarithmic in the number of records but with an enormous base, which bounds the number of I/Os per lookup.

Ch. 02

Balanced by Construction

B-trees guarantee that all leaves sit at the same depth. Splits propagate upward toward the root rather than letting one path grow long, so worst-case and average-case lookup latencies stay close.

Ch. 02

B+ Tree Variant

In the variant used by virtually every modern database, all data lives in the leaves and internal nodes hold only separator keys. Leaves are linked into a sequence so range scans walk the leaf chain instead of re-traversing the tree.

Ch. 02

Occupancy Invariants

Each node must stay between a minimum (often half full) and maximum capacity. Violations trigger splits or merges, which is what keeps the tree both balanced in height and reasonably full on disk.

Ch. 02 · Vocab
B-tree
A self-balancing, block-oriented search tree with high fanout, designed for block-addressable storage.
B+ tree
A B-tree variant where data lives only in leaves and internal nodes hold only separators; leaves are typically linked.
Fanout
The number of children an internal node can have; high fanout gives B-trees their shallow depth.
Separator key
A key in an internal node used to route searches to the correct child subtree.
Ch. 02 · Vocab
Occupancy
The fraction of a node's capacity in use, bounded above and below.
Tree height
The number of edges from root to leaf; logarithmic in record count with base equal to fanout.
Root, internal node, leaf
The three structural roles — entry point, router, and data holder.
Node (page)
One tree node corresponds to one page on disk; the unit of allocation and I/O.
Ch. 02 · Quiz1 / 4

Multiple choice

A B-tree indexes one billion records with a fanout of roughly 500 per internal node. Approximately how many disk page reads does a single point lookup require in the worst case?

Ch. 02 · Quiz2 / 4

Multiple choice

Why is a balanced binary search tree a poor on-disk index, even though it gives O(log n) lookups in memory?

Ch. 02 · Quiz3 / 4

Multiple choice

A workload runs frequent range scans over sorted ranges of keys. Which B-tree variant trait makes these scans cheap?

Ch. 02 · Quiz4 / 4

Spot the issue

A custom B-tree implementation skips occupancy invariants — nodes are allowed to become arbitrarily empty after deletes and arbitrarily full before splits. What's the main consequence?

Ch. 03

File Formats

Zooms out from the tree algorithm to the bytes on disk: how database files are organized into pages, how pages hold variable-length records, and how slotted pages, headers, cell pointers, checksums, and magic numbers make files self-describing, robust, and version-evolvable.

Ch. 03

Files as Page Arrays

A database file is a sequence of fixed-size pages identified by page IDs. A page ID translates directly to a file offset by multiplication, so any page can be located in O(1) without an index.

Ch. 03

Slotted Page Layout

The canonical solution for variable-length records: an array of slot pointers grows from one end of the page while cell payloads grow from the other, with free space in the middle. This decouples logical order (slot order) from physical layout.

Ch. 03

Cells and Cell Pointers

Variable-length payloads are stored as cells. An indirection array of slot pointers holds the byte offsets of cells, so the page can reorder its logical contents without physically moving bytes — and external pointers can target a slot without breaking on in-page reorganization.

Ch. 03

Page Header

Every page begins with a fixed-size metadata block: page type, free-space pointers, cell count, checksum, and sometimes sibling pointers. The header is what makes a page self-identifying after a crash or upgrade.

Ch. 03

Magic Numbers, Checksums, Versioning

Files embed magic bytes identifying the format, a version number so the engine can evolve the layout, and per-page checksums (often CRC) to detect silent corruption. Together they make the file format both robust and forward-compatible.

Ch. 03

Logical vs. Physical Pointers

Within a page, cells reference one another by slot index (logical), so in-page compaction never invalidates references. Across pages, references use page IDs (physical), which decouples in-page reorganization from cross-page links.

Ch. 03 · Vocab
Page
A fixed-size, contiguous region of a database file; the unit of I/O.
Slotted page
A page layout with a header, slot pointers growing from one end, and cell payloads growing from the other.
Cell
The on-page representation of a single record, variable in size.
Cell pointer (slot)
A small fixed-size entry giving the offset and length of a cell.
Ch. 03 · Vocab
Page header
Fixed-size metadata block at the start of a page.
Checksum
Value computed over a page's bytes, stored in the header to detect corruption.
Magic number
Byte sequence at the start of a file identifying its format and version.
Varint
Variable-length integer encoding using fewer bytes for small values.
Ch. 03 · Vocab
Page ID
Logical identifier that maps to a page's offset in the file.
Ch. 03 · Quiz1 / 4

Multiple choice

A database file is laid out as a sequence of fixed-size pages. How does the engine locate page N within the file?

Ch. 03 · Quiz2 / 4

Multiple choice

Why does the slotted page layout grow the slot pointer array from one end of the page and cell payloads from the other, leaving free space in the middle?

Ch. 03 · Quiz3 / 4

Multiple choice

An on-disk database file silently has a few bits flipped by a failing storage device. Which mechanism described in the chapter detects this corruption?

Ch. 03 · Quiz4 / 4

Spot the issue

An engineer proposes that, within a page, cells should reference each other by absolute byte offset rather than by slot index, "to save one level of indirection on reads." What goes wrong?

Ch. 04

Implementing B-Trees

The operational counterpart to Chapter 2: how the abstract algorithms (search, insert, delete, split, merge) actually run on top of the slotted-page format from Chapter 3, plus the engineering refinements real systems use — sibling links, rightmost-append optimization, key compression, bulk loading, and free-page management.

Ch. 04

Page Splits

When an insert would overflow a node, the node is split in two: roughly half the entries go to each half and a new separator is promoted to the parent. The split can cascade up to the root, which is the only way a B-tree grows taller.

Ch. 04

Page Merges and Redistribution

When a delete leaves a node under-occupied, the engine either redistributes entries with a sibling (cheap) or merges two siblings into one and removes the separator from the parent (more expensive). Like splits, merges can cascade upward and reduce tree height.

Ch. 04

Sibling Links Between Leaves

Leaves carry pointers to the next (and sometimes previous) sibling, so a range scan walks the leaf chain without re-descending the tree. These links must be maintained carefully through splits and merges, especially under concurrent access.

Ch. 04

Rightmost-Leaf Optimization

Monotonically increasing keys — autoincrement IDs, timestamps — always hit the rightmost leaf. Engines special-case this path to append or split sparsely instead of doing 50/50 splits that would leave half-empty pages forever.

Ch. 04

Key Compression

Adjacent keys on a node usually share a prefix. Prefix compression stores only the differing bytes; suffix truncation picks the shortest separator that still distinguishes children. Both pack more keys per page and raise fanout.

Ch. 04

Bulk Loading

Building a B-tree from sorted input by filling leaves to capacity and constructing internal levels bottom-up produces a denser, better-balanced tree than repeated inserts. Used when restoring backups or building a new index.

Ch. 04

Latch Coupling

Operations that change tree structure must coordinate access to multiple pages. Latch coupling (crabbing) acquires a child latch before releasing the parent's, ensuring safe navigation even as splits and merges run concurrently.

Ch. 04 · Vocab
Page split
Structural modification that creates a new sibling when a node overflows, promoting a separator upward.
Page merge
Structural modification that combines two under-occupied siblings into one.
Redistribution
Moving entries between siblings to restore occupancy without merging.
Sibling pointer
A pointer linking a node to its same-level neighbor, used for scans.
Ch. 04 · Vocab
Prefix compression
Storing only the differing suffix of each key against a shared page-wide prefix.
Suffix truncation
Choosing the shortest separator key needed to split left from right.
Bulk loading
Building a B-tree bottom-up from sorted input rather than via insertions.
Latch vs. lock
Latches are short-lived primitives on in-memory pages; locks are transactional and protect logical data.
Ch. 04 · Vocab
Latch coupling (crabbing)
Holding a child latch before releasing the parent's during traversal.
Ch. 04 · Quiz1 / 4

Multiple choice

A table's primary key is a monotonically increasing autoincrement ID, and the engine uses standard 50/50 splits on every insert at the rightmost leaf. What inefficiency does the chapter call out?

Ch. 04 · Quiz2 / 4

Multiple choice

Building a fresh B-tree index from a large already-sorted dataset, which approach produces the densest and most balanced result?

Ch. 04 · Quiz3 / 4

Multiple choice

Concurrent inserts are running on a B-tree while another thread is descending it for a lookup. Which technique ensures the descending thread does not navigate into a page mid-split?

Ch. 04 · Quiz4 / 4

Spot the issue

A team disables prefix compression and suffix truncation on internal nodes, arguing the code is simpler and "modern disks are fast." What's the main downside the chapter highlights?

Ch. 05

Transaction Processing and Recovery

Explores how databases guarantee ACID through the interplay of buffer manager, lock manager, page cache, and log manager. Walks through buffer policies (steal/no-steal, force/no-force), the ARIES recovery algorithm with its three phases, and concurrency control mechanisms from 2PL through OCC to MVCC.

Ch. 05

ACID Properties

Atomicity guarantees all-or-nothing; Consistency preserves invariants; Isolation hides concurrent effects; Durability ensures committed data survives crashes. These four guarantees are the contract a transaction manager must honor.

Ch. 05

Write-Ahead Log

A durable append-only log where every change is recorded before being applied to data pages. Durability falls out of WAL persistence, and recovery falls out of WAL replay — without a WAL there is no ACID.

Ch. 05

Steal and Force Policies

Steal allows dirty (uncommitted) pages to be flushed (requires UNDO logging); No-Force lets committed pages stay in memory at commit (requires REDO logging). Most modern systems pick steal + no-force for performance, which mandates both UNDO and REDO in the WAL.

Ch. 05

ARIES Recovery

A three-phase algorithm: Analysis scans forward from the last checkpoint to rebuild the dirty-page and transaction tables; Redo repeats history by reapplying all logged updates; Undo rolls back losers using compensation log records (CLRs) so undo itself is idempotent and can be redone if recovery is interrupted.

Ch. 05

Concurrency Control Approaches

Pessimistic (lock-based, e.g., 2PL) assumes conflicts and prevents them up front; Optimistic (OCC) lets transactions proceed and validates at commit; Multi-version (MVCC) keeps multiple versions so readers don't block writers. Each trades latency, throughput, and abort rate differently.

Ch. 05

Isolation Levels and Anomalies

Read Uncommitted, Read Committed, Repeatable Read, Snapshot Isolation, and Serializable progressively rule out anomalies — dirty reads, non-repeatable reads, phantom reads, lost updates, and write skew. Snapshot isolation prevents most anomalies but still allows write skew.

Ch. 05

Latches vs. Locks

Locks are logical, protect rows or tables, last for the whole transaction, and live in the lock manager. Latches are physical, protect in-memory page structures, last microseconds, and are essentially mutexes — confusing them is one of the classic database bugs.

Ch. 05 · Vocab
LSN
Log Sequence Number — monotonically increasing WAL record ID; orders operations and tracks which records have been applied.
Checkpoint
A persisted snapshot of dirty-page and active-transaction state that bounds recovery work.
CLR
Compensation Log Record — a record written during undo that makes undo idempotent.
Dirty Page Table
In-memory structure tracking buffer-pool pages with unflushed modifications.
Ch. 05 · Vocab
Snapshot Isolation
MVCC level where each transaction reads from a consistent snapshot taken at its start.
Phantom Read
Range query returns rows inserted by another transaction between executions.
Deadlock detection
Building a waits-for graph and breaking cycles by aborting a victim.
Buffer pool
In-memory page cache whose eviction policy interacts with the WAL via steal.
Ch. 05 · Quiz1 / 4

Multiple choice

A storage engine uses the steal + no-force buffer policy. Which combination of logging is mandatory for correct recovery?

Ch. 05 · Quiz2 / 4

Multiple choice

During ARIES recovery, why are Compensation Log Records (CLRs) written during the Undo phase rather than simply rolling back in memory?

Ch. 05 · Quiz3 / 4

Spot the issue

A team debugging a deadlock decides to hold page latches for the duration of each transaction instead of just for the page operation, hoping this will serialize access. What's wrong?

Ch. 05 · Quiz4 / 4

Multiple choice

A workload runs at Snapshot Isolation. Two concurrent transactions each read the same two on-call rows, then each updates a different row to preserve the invariant "at least one doctor must remain on call." Both commit, and the invariant is violated. Which anomaly occurred?

Ch. 06

B-Tree Variants

Surveys alternative B-tree designs that address shortcomings of the classic B+Tree — concurrency bottlenecks, write amplification from in-place updates, and poor fit for flash. Variants include copy-on-write trees, lazy B-trees, FD-trees, Bw-trees, and the B-link concurrency optimization.

Ch. 06

Copy-on-Write B-Trees

Instead of mutating pages in place, modifications write new copies of affected pages and propagate up to a new root. This yields wait-free reads and crash safety without a WAL (used by LMDB), at the cost of write amplification and shadowing the entire path on every write.

Ch. 06

Shadow Paging

The mechanism underlying copy-on-write: a shadow version of the tree is built, then committed by atomically swapping the root pointer. Old versions remain readable until garbage collected, which is what enables snapshot reads for free.

Ch. 06

B-link Trees

Each node carries a right-sibling link, and node splits become a two-step process: link the new node into the sibling chain first, then register it in the parent. A descending reader that lands on a node that has since split can simply follow the link, allowing lock-free or minimally-locked descent.

Ch. 06

Lazy B-Trees

Each node has an in-memory write buffer that absorbs small random writes; buffers are flushed in bulk. This amortizes I/O cost across many operations and is the strategy used by buffered-tree designs like WiredTiger.

Ch. 06

FD-Trees

Designed for flash storage: a small head B-tree sits over a cascade of sorted runs on flash, using fractional cascading so lookups stay efficient while writes become sequential. This reduces write amplification, which matters because flash wears out with random writes.

Ch. 06

Bw-Trees

A latch-free B-tree variant from Microsoft Research that maps page IDs to memory via a mapping table, with updates installed as delta records chained via compare-and-swap. Periodic consolidation flattens chains; the design is well-suited to multicore in-memory workloads where latches become the bottleneck.

Ch. 06 · Vocab
Copy-on-write
Writes create new pages instead of mutating; the root swap is the commit point.
Shadow paging
The mechanism behind copy-on-write — build a shadow tree, atomically swap the root.
B-link tree
B-tree with right-sibling pointers enabling concurrent descent through in-progress splits.
Delta chain
Bw-tree's linked list of update records prepended to a base page.
Ch. 06 · Vocab
Mapping table
Bw-tree's indirection layer mapping page IDs to in-memory pointers.
Consolidation
Bw-tree process of flattening a delta chain back into a single base page.
Write amplification
Bytes physically written per byte logically written by the application.
Fractional cascading
Inter-level pointer hints that reduce multi-level search cost from O(log²n) to O(log n).
Ch. 06 · Quiz1 / 4

Multiple choice

A copy-on-write B-tree like LMDB modifies a leaf record. Which set of pages physically gets rewritten on disk before the commit becomes visible?

Ch. 06 · Quiz2 / 4

True / False

In a B-link tree, a reader that descends to a node which has since been split by a concurrent writer must restart its traversal from the root.

Ch. 06 · Quiz3 / 4

Multiple choice

What is the role of the mapping table in a Bw-tree, and why does it enable latch-free updates?

Ch. 06 · Quiz4 / 4

Spot the issue

An engineer picks a copy-on-write B-tree for a workload of many small random updates, expecting "no WAL means faster writes." A month later they're seeing terrible write throughput and far more bytes hitting disk than expected. What's the chapter-level diagnosis?

Ch. 07

Log-Structured Storage

Introduces log-structured merge trees as an alternative to in-place B-tree updates, optimizing for write-heavy workloads by buffering in a memtable and flushing immutable SSTables. Covers compaction strategies, bloom filters, the three amplification factors, and the read-path machinery (tombstones, sparse indexes, merge iterators) that makes LSMs viable.

Ch. 07

LSM Tree Architecture

Writes go to an in-memory memtable (skip list or balanced tree); when full, the memtable is flushed to disk as an immutable SSTable. Reads must consult the memtable and potentially many on-disk SSTables, merged at read time.

Ch. 07

SSTables

Sorted String Tables — immutable on-disk files with key-sorted entries, a sparse index, and a metadata footer. Immutability means reads need no locking, writes are pure sequential I/O, and compaction can merge files cheaply.

Ch. 07

Compaction Strategies

Size-tiered (Cassandra) groups same-size SSTables and merges them — high write throughput, more space amplification. Leveled (RocksDB, LevelDB) keeps non-overlapping SSTables per level — lower space and read amplification, more write amplification.

Ch. 07

The Three Amplifications

Write amplification is bytes written to disk per byte of user write; read amplification is on-disk reads per user lookup; space amplification is disk space per byte of live data. LSM tuning is fundamentally about trading these three off — you cannot minimize all simultaneously.

Ch. 07

Bloom Filters

Per-SSTable probabilistic structures that definitively say a key is absent (no false negatives) and may give "possibly present" for hits. They eliminate disk seeks for misses, which is what keeps point-lookup read amplification manageable.

Ch. 07

Tombstones

Special delete markers that suppress earlier values during merges. A tombstone can only be discarded once compaction guarantees no older shadowed value of the key remains — which is why deletes in LSMs are not free.

Ch. 07

Merge Iterators

A read consults the memtable and SSTables newest to oldest, using a priority-queue-style merge iterator that yields the newest version of each key while skipping shadowed entries and tombstones. The same machinery powers range scans and compaction.

Ch. 07 · Vocab
Memtable
In-memory sorted buffer (skip list) absorbing writes before flush.
SSTable
Sorted String Table — immutable, key-sorted on-disk file with data, sparse index, metadata.
Compaction
Background merge of SSTables that drops tombstoned/superseded keys.
Leveled compaction
Strategy with non-overlapping SSTables per level and size ratios between levels.
Ch. 07 · Vocab
Size-tiered compaction
Strategy that buckets SSTables by size and merges full buckets.
Sparse index
One index entry per data block, scanned linearly within a block.
Tombstone
Delete marker that persists until compaction guarantees no older value remains.
Bloom filter
Probabilistic set with no false negatives, used to skip absent-key SSTables.
Ch. 07 · Vocab
Read amplification
Physical reads per logical read; minimized via bloom filters and sparse indexes.
Ch. 07 · Quiz1 / 4

Multiple choice

On a point lookup for a key that does not exist, what keeps an LSM tree's read amplification from blowing up across many SSTables?

Ch. 07 · Quiz2 / 4

Multiple choice

A team picks size-tiered compaction for an analytics store with frequent overwrites of the same keys. They later complain that disk usage is roughly 2x the live data set. What's the chapter's explanation?

Ch. 07 · Quiz3 / 4

Multiple choice

Why can a tombstone in an LSM not be discarded as soon as compaction merges it with a value?

Ch. 07 · Quiz4 / 4

Spot the issue

A team migrates a heavy-write, point-lookup-dominated workload from B-tree to LSM and disables bloom filters to save the per-SSTable memory overhead. Reads collapse to a few hundred QPS. What's the diagnosis?

Part 02

Distributed Systems

Ch. 8–14

Ch. 08

Introduction and Overview

Opens Part II by framing the fundamental challenges of distributed systems: independent nodes coordinating over unreliable networks with partial failures. Surveys the foundational concepts — concurrency, asynchrony, partial failure, time — and the theoretical bounds (FLP, CAP) that define what is achievable.

Ch. 08

Fallacies of Distributed Computing

Peter Deutsch's classic list of false assumptions: the network is reliable, latency is zero, bandwidth is infinite, the network is secure, topology doesn't change, there is one administrator, transport cost is zero, the network is homogeneous. Every distributed database must be designed assuming each is false.

Ch. 08

Two Generals Problem

Two parties communicating over a lossy channel cannot achieve guaranteed consensus on a coordinated action using any finite number of messages. The proof motivates why perfect agreement is impossible and why real systems settle for acknowledgments, retries, and probabilistic guarantees.

Ch. 08

FLP Impossibility

Fischer, Lynch, and Paterson proved that in a purely asynchronous system with even one crash-faulty process, no deterministic consensus algorithm can guarantee termination. Real systems sidestep FLP with timeouts, randomization, or partial synchrony.

Ch. 08

System Models

Synchronous systems have known bounds on message delay; asynchronous systems have none; partially synchronous systems behave asynchronously sometimes but eventually become synchronous. Most real systems are designed under partial synchrony — it is the assumption that makes consensus protocols terminate.

Ch. 08

Failure Models

Categorize how processes can fail: crash-stop (halts forever), crash-recovery (restarts with state), omission (drops messages), Byzantine (arbitrary or malicious). The chosen model determines algorithm complexity — Byzantine tolerance, for example, requires 3f+1 nodes instead of 2f+1.

Ch. 08

Safety vs. Liveness

Safety means "nothing bad ever happens" (two leaders are never elected); Liveness means "something good eventually happens" (a leader is eventually elected). FLP shows you cannot always have both in asynchrony, which is why every consensus protocol leans on a timing assumption for liveness.

Ch. 08

Logical vs. Physical Time

Physical clocks drift and skew between nodes; logical clocks (Lamport timestamps, vector clocks) capture causal ordering without depending on wall-clock accuracy. Distributed databases rely on logical time for correctness and physical time only as a hint.

Ch. 08 · Vocab
Distributed system
A collection of independent processes communicating by message passing, presented as a single coherent system.
Partial failure
Failure mode where some components fail while others continue, often indistinguishably.
Consensus
The problem of getting processes to agree on a single value despite failures.
Quorum
A subset of nodes whose agreement suffices to make a decision, typically a majority.
Ch. 08 · Vocab
Asynchrony
No upper bound on message delivery or processing time.
Crash fault
A process simply stops; contrast with Byzantine where it behaves arbitrarily.
State machine replication
Replicate a deterministic state machine by applying the same ordered inputs.
Partial synchrony
Bounds on delay eventually hold, even if not always.
Ch. 08 · Quiz1 / 4

Multiple choice

A team designs a consensus protocol and assumes "if a node hasn't replied in 50ms, it has crashed" everywhere in the code, with no notion of timeouts being only hints. Which fallacy of distributed computing are they most directly violating?

Ch. 08 · Quiz2 / 4

Multiple choice

Which statement most accurately summarizes the FLP impossibility result and how real systems cope with it?

Ch. 08 · Quiz3 / 4

Spot the issue

A storage engineer argues: "Our leader election protocol guarantees both safety (never two leaders) and liveness (a leader is always elected) under any network conditions, including fully asynchronous ones." What's wrong with this claim?

Ch. 08 · Quiz4 / 4

Multiple choice

A team needs to tolerate up to f arbitrary (Byzantine) faults in their replicated system. What is the minimum number of nodes they must deploy, and why does it differ from a crash-only design?

Ch. 09

Failure Detection

Explains how nodes detect peer failure despite the impossibility of distinguishing a slow node from a dead one in an asynchronous network. Walks from naive timeouts through adaptive Phi Accrual detectors to gossip-based SWIM, and the fundamental trade-off between completeness and accuracy.

Ch. 09

Heartbeats and Pings

The two basic primitives: a heartbeat is a periodic "I'm alive" pushed by the monitored node; a ping is a request-reply from the monitor. Both rely on timeouts, which conflate slowness with failure — the central problem of detection.

Ch. 09

Completeness vs. Accuracy

Chandra and Toueg's classification: completeness is the guarantee that crashed processes are eventually suspected; accuracy is the guarantee that healthy processes are not wrongly suspected. No detector in an async system has strong accuracy, so practical detectors are eventually perfect (◇P).

Ch. 09

Phi Accrual Failure Detector

Used in Cassandra and Akka: instead of a boolean alive/dead verdict, it outputs a continuous suspicion level φ based on the statistical distribution of past inter-arrival times. Applications choose their own threshold to balance speed against false positives.

Ch. 09

SWIM Protocol

Scalable Weakly-consistent Infection-style process group Membership: each node periodically pings a random peer; if that ping fails, it asks k other peers to probe indirectly before declaring suspicion. Membership changes spread epidemically, giving O(1) load per node regardless of cluster size.

Ch. 09

Indirect Probing

When a direct ping fails, the monitor asks intermediaries to probe on its behalf, distinguishing a failed local link from a failed node. This is the single most effective trick for cutting false-positive rates in real clusters.

Ch. 09

Suspicion vs. Failure

Many systems use a two-stage state machine: a node first becomes "suspected," and only after corroboration or a longer timeout does it transition to "failed." This dampens flapping during transient network issues or GC pauses.

Ch. 09 · Vocab
Failure detector
Subsystem giving each process a list of currently suspected peers.
Eventually perfect detector (◇P)
Complete and eventually accurate; sufficient for consensus under partial synchrony.
False positive
Declaring a live node dead, often from GC pauses or transient network delays.
Inter-arrival time
Interval between consecutive heartbeats; the basis of statistical detectors.
Ch. 09 · Vocab
Suspicion level (φ)
Continuous value derived from the probability a heartbeat is still in flight.
Indirect ping
Probe routed through intermediaries to disambiguate link from node failure.
Membership protocol
Mechanism by which nodes learn the current cluster roster.
Epidemic dissemination
Gossip-style spread of information, achieving full coverage in O(log N) rounds.
Ch. 09 · Quiz1 / 4

Multiple choice

Operators tune their boolean alive/dead timeout aggressively to detect failures faster, but now they get a flood of false positives during GC pauses. Which failure detector design directly addresses this by replacing the boolean verdict with a tunable, statistical signal?

Ch. 09 · Quiz2 / 4

Multiple choice

In Chandra and Toueg's classification, which property does a practical failure detector in an asynchronous network typically achieve, given that strong accuracy is impossible?

Ch. 09 · Quiz3 / 4

Spot the issue

A monitor pings node N and the ping times out. The monitor immediately marks N as failed and removes it from the cluster. What's the most likely problem with this design, and what mechanism would fix it?

Ch. 09 · Quiz4 / 4

True / False

SWIM scales because the per-node load grows with cluster size, since every node must heartbeat every other node every interval.

Ch. 10

Leader Election

How a group agrees on a single coordinator to serialize decisions, drive replication, or commit writes. Contrasts classical algorithms (Bully, Ring) with the practical mechanisms real systems use — leader leases, stable leaders, epoch numbers — and warns about split-brain and fencing.

Ch. 10

Why Elect a Leader

A single leader serializes writes, orders operations, and avoids running consensus per action. The trade-off is that the leader is a bottleneck and a failure point — so re-election must be fast and safe enough to keep the cluster alive.

Ch. 10

Bully Algorithm

When a node notices the leader is gone, it sends an ELECTION message to all higher-ID nodes; if none reply, it declares itself leader. Simple but produces O(N²) messages and can run multiple concurrent elections during partitions.

Ch. 10

Ring-Based Election

Nodes are arranged in a logical ring; election messages carry the highest ID seen so far and circulate until one returns to its originator. Lower message count than Bully but assumes the ring topology is intact, which is fragile under partitions.

Ch. 10

Leader Leases

A leader holds a time-bounded lease and must renew it before expiry; other nodes will not elect a new leader until the lease is known to have expired. Leases bound the window of dual leadership and rely on bounded clock drift rather than perfect synchrony.

Ch. 10

Stable Leader Optimization

Once elected, the leader stays in place across many rounds of operations, amortizing the cost of election. Most real systems — Raft, Multi-Paxos, ZooKeeper's ZAB — keep the leader until it visibly fails.

Ch. 10

Split Brain

A partition causes two subsets to each elect their own leader, both accepting writes. Mitigated by requiring a leader to hold a majority quorum (so at most one partition can elect) and by fencing tokens that invalidate stale leaders' writes downstream.

Ch. 10

Fencing Tokens

A monotonically increasing identifier issued with each new leadership term; downstream systems reject operations bearing stale tokens. This is what ensures a deposed leader cannot corrupt state after a new one takes over.

Ch. 10 · Vocab
Leader / coordinator
A node granted authority to serialize decisions for the group.
Term / epoch
Monotonically increasing integer identifying a leadership tenure.
Lease
Time-bounded grant of authority that must be renewed.
Quorum-based election
Requires a majority vote, ensuring at most one leader per term.
Ch. 10 · Vocab
Split brain
Simultaneous leaders in disjoint partitions.
Fencing token
Monotonic ID that lets downstream resources reject stale leaders.
Election timeout
Duration after which a follower starts a new election.
Stable leader
A leader that persists across operations to amortize election cost.
Ch. 10 · Quiz1 / 4

Multiple choice

A cluster of 5 nodes loses network connectivity between two subgroups of sizes 3 and 2. Each subgroup runs the Bully algorithm and elects a leader; both leaders begin accepting writes. What's the cleanest structural fix to make split brain impossible?

Ch. 10 · Quiz2 / 4

Multiple choice

A leader holds a 10-second lease and crashes 1 second into it. The cluster waits for the lease to expire before electing a new leader. What core safety property does the lease provide, and what physical assumption does it require?

Ch. 10 · Quiz3 / 4

Spot the issue

A new leader is elected, but the deposed previous leader (whose lease has expired in the cluster's view, but whose local clock thinks it's still leader) sends a write to the storage layer. The storage layer accepts it, corrupting state. What mechanism would have prevented this?

Ch. 10 · Quiz4 / 4

Multiple choice

A team complains that their consensus cluster runs an election before every single client write, eating throughput. Which optimization, used by Raft, Multi-Paxos, and ZAB, directly addresses this?

Ch. 11

Replication and Consistency

How multiple replicas hold copies of data and still present a coherent view despite concurrent updates, failures, and partitions. Walks through the consistency hierarchy from linearizability down to eventual consistency, introduces CAP and PACELC, and the practical tools — quorums, read repair, hinted handoff, vector clocks — that real systems use.

Ch. 11

CAP Theorem

Under a network partition, a system must choose between consistency and availability — only two of three are guaranteed. CAP applies only during partitions, which is why PACELC was introduced to capture the normal-operation trade-off too.

Ch. 11

PACELC

Extends CAP: under Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency. Captures the fact that even healthy systems trade strong consistency for lower latency — Dynamo-style stores are PA/EL; HBase is PC/EC.

Ch. 11

Consistency Models

A ladder of guarantees: linearizability (atomic and respecting real-time order globally), sequential (one order consistent with each client's program order), causal (cause-and-effect preserved; concurrent operations may differ across observers), eventual (replicas converge absent new writes).

Ch. 11

Quorums and R+W>N

With N replicas, a write quorum W and read quorum R with R+W>N ensures every read overlaps the latest write, giving strong consistency for single keys. Tuning R and W trades read vs. write latency and durability.

Ch. 11

Read Repair and Anti-Entropy

Read repair fixes stale replicas opportunistically when a quorum read detects divergence. Anti-entropy (e.g., Merkle-tree sync) runs in the background to converge all replicas regardless of read pattern. Together they provide convergence in eventually consistent systems.

Ch. 11

Hinted Handoff

When a target replica is unreachable, the coordinator stores the write locally as a hint and replays it once the target recovers. Improves availability and durability without blocking the writer — though unbounded hint queues become their own problem.

Ch. 11

Vector Clocks

A vector of per-node logical counters attached to each value; comparing vectors classifies updates as causally prior, later, or concurrent. Concurrent versions are siblings that must be reconciled by the application, by CRDTs, or by last-write-wins.

Ch. 11

Session Guarantees

Per-client middle-ground promises proposed by Terry et al.: read-your-writes, monotonic reads, monotonic writes, and writes-follow-reads. They are weaker than linearizability but eliminate the worst surprises in eventually consistent systems.

Ch. 11 · Vocab
Linearizability
Strongest single-object consistency — atomic, respects real-time order globally.
Eventual consistency
Liveness guarantee — replicas converge once writes stop.
Strong eventual consistency
Replicas with the same updates are in the same state (CRDTs).
Quorum (N, R, W)
Total replicas, read quorum, write quorum; R+W>N is the strong-consistency condition.
Ch. 11 · Vocab
Conflict / sibling version
Two concurrent updates incomparable under vector clocks.
Read repair
Synchronous reconciliation triggered by a divergent quorum read.
Anti-entropy
Background reconciliation that converges replicas independent of foreground requests.
Hinted handoff
Local storage of a write for an unreachable replica, replayed when it returns.
Ch. 11 · Vocab
Tunable consistency
Per-operation choice of R and W (Cassandra/Dynamo style).
Ch. 11 · Quiz1 / 4

Spot the issue

A team configures a Dynamo-style store with N=3, R=1, W=1 and expects "strong consistency" because every key has 3 replicas. What's wrong?

Ch. 11 · Quiz2 / 4

Multiple choice

A vendor markets a "CA" database that gives both consistency and availability and "doesn't trade off anything." According to PACELC, what real trade-off does even a healthy, partition-free deployment face?

Ch. 11 · Quiz3 / 4

Multiple choice

Two clients concurrently update the same key on different replicas. A vector clock comparison shows neither version's vector dominates the other. What does this mean and what should the system do?

Ch. 11 · Quiz4 / 4

Multiple choice

A user reports that after writing a comment on their phone, refreshing the page sometimes shows the comment missing, then it reappears on a later refresh. Which session guarantee would, if enforced, eliminate this surprise without requiring full linearizability?

Ch. 12

Anti-Entropy and Dissemination

How distributed databases keep replicas in sync over time, compensating for missed writes, dropped messages, and node failures. Contrasts foreground techniques (read repair, hinted handoff) with background reconciliation (Merkle-tree anti-entropy) and gossip-style epidemic protocols used for cluster metadata.

Ch. 12

Read Repair

When a coordinator reads from multiple replicas and detects divergence, it asynchronously writes the freshest version back to stale replicas. Cheap to bolt onto the read path, but only repairs keys that are actually being read — cold data drifts forever without anti-entropy.

Ch. 12

Digest Reads

To save bandwidth, the coordinator requests a full value from one replica and lightweight hash digests from the others, fetching the full value from a disagreeing replica only when needed. This is the standard optimization layered on top of read repair (used by Cassandra).

Ch. 12

Hinted Handoff

If a replica is unreachable at write time, another node temporarily stores a hint containing the write and replays it once the target recovers. Preserves durability without blocking the writer, but unbounded hint queues become a liability under long outages.

Ch. 12

Merkle Trees

A hash tree over a key range where each parent is the hash of its children. Two replicas can compare just the roots and recursively descend only into subtrees that differ, exchanging O(log n) hashes instead of all keys — the workhorse of anti-entropy.

Ch. 12

Gossip Dissemination

Each node periodically picks a random peer and exchanges state, so updates spread exponentially through the cluster in O(log N) rounds. Used for membership, failure detection, and schema rather than user data, because gossip is eventually consistent and bandwidth-cheap.

Ch. 12

Push, Pull, Push-Pull

In push, infected nodes proactively send updates; in pull, susceptible nodes ask peers for news; push-pull combines both and converges fastest. The choice trades convergence speed against wasted messages once most nodes already know the update.

Ch. 12

Plumtree

Push-Lazy-Push Multicast: builds a spanning tree over a gossip overlay so the bulk of payloads travels along tree edges while only small "I have message X" announcements gossip on the remaining links. Heals the tree when nodes fail and cuts redundant payload traffic dramatically.

Ch. 12 · Vocab
Anti-entropy
Background process reconciling replica divergence.
Entropy (replication)
Accumulated drift between replicas from lost messages or failed writes.
Hint
Locally persisted record of a write intended for an unreachable replica.
Digest
Hash summary of a value used to detect disagreement without transferring the data.
Ch. 12 · Vocab
Merkle tree
Binary hash tree enabling logarithmic-cost reconciliation.
Dotted version vector
Version vector extended with per-event dots for precise concurrency tracking.
Gossip protocol
Randomized peer-to-peer state exchange.
SWIM
Gossip-based failure detector with indirect pings and disseminated state events.
Ch. 12 · Vocab
Hybrid logical clock
Clock combining physical time with a logical counter for causality plus wall-clock proximity.
Ch. 12 · Quiz1 / 4

Multiple choice

Two Cassandra replicas have drifted apart over months because the keys involved are rarely read. The operator wants to reconcile them while exchanging the minimum amount of data over the wire. Which mechanism is designed for exactly this case?

Ch. 12 · Quiz2 / 4

Multiple choice

A gossip protocol is propagating a small membership update through a 10,000-node cluster. Pure "push" wastes messages because infected nodes keep pushing to peers that already know, while pure "pull" is slow at the start when almost no one has the update. Which variant converges fastest by combining both?

Ch. 12 · Quiz3 / 4

Spot the issue

A node went offline for three days during a Black Friday event. The coordinator dutifully accumulated hinted writes for it the entire time, and the hint queue grew to hundreds of gigabytes. When the node returned, replaying the hints overwhelmed it and the cluster destabilized. What's the underlying problem the chapter warns about?

Ch. 12 · Quiz4 / 4

Multiple choice

Plumtree improves on naive gossip by routing the bulk of payload traffic along a particular structure while still using gossip for resilience. What is that structure, and what travels on the non-tree links?

Ch. 13

Distributed Transactions

Generalizes ACID transactions across multiple nodes, where a single commit decision must be reached despite partial failures. Walks through atomic commitment protocols (2PC, 3PC), the systems that make distributed transactions practical (Spanner, Calvin, Percolator), and long-running alternatives like sagas.

Ch. 13

Atomic Commitment

The requirement that all participants either commit or abort, with no participant deciding differently. Unlike consensus, it must respect any single "abort" vote — which is what makes it strictly harder to make non-blocking under failures.

Ch. 13

Two-Phase Commit

The coordinator first asks all participants to prepare (durably promise they can commit); on unanimous yes it sends commit, otherwise abort. Correct under no-failure but blocking: if the coordinator crashes after some participants vote yes, they hold locks indefinitely.

Ch. 13

Three-Phase Commit

Inserts a pre-commit phase between prepare and commit so that a recovering participant can infer the outcome from peers without the coordinator. Solves blocking under fail-stop but is not safe under network partitions and is rarely used in practice.

Ch. 13

Calvin and Deterministic Transactions

Calvin pre-orders transactions through a replicated sequencing layer; all replicas then execute the same ordered batch deterministically. Eliminates 2PC entirely — but requires knowing read/write sets up front, which constrains transactional flexibility.

Ch. 13

Spanner and TrueTime

Google Spanner uses GPS- and atomic-clock-backed TrueTime, which returns a bounded interval [earliest, latest] for "now". Spanner assigns commit timestamps and waits out this uncertainty before releasing locks, giving externally consistent distributed transactions.

Ch. 13

Percolator

Google's snapshot-isolation layer over BigTable using per-row lock columns, a primary lock as the commit point, and a Timestamp Oracle for global ordering. Commit is effectively client-driven 2PC whose atomicity hinges on a single atomic write that releases the primary lock.

Ch. 13

Sagas

A long-lived business transaction is decomposed into local transactions, each with a compensating action. Sagas trade ACID atomicity for availability and progress, accepting that intermediate states are visible and that compensation, not rollback, restores invariants.

Ch. 13 · Vocab
Distributed transaction
Transaction whose operations span multiple nodes and must satisfy ACID as a whole.
Prepare phase
First phase of 2PC; participants durably vote yes/no.
In-doubt transaction
Participant state after voting yes but before learning the outcome — locks held.
Presumed abort / commit
2PC optimization treating missing logs as a default outcome.
Ch. 13 · Vocab
TrueTime
Spanner's clock API exposing uncertainty as an interval.
Commit wait
Stalling at commit until TrueTime guarantees the timestamp is in the past everywhere.
Timestamp oracle
Centralized service handing out monotonic timestamps (Percolator).
Snapshot isolation
Each transaction reads from a consistent start-time snapshot; commits if no write-write conflict.
Ch. 13 · Vocab
Compensating transaction
Operation that semantically undoes a previously committed saga step.
Ch. 13 · Quiz1 / 4

Spot the issue

A team runs 2PC across five participants. Three have voted yes and durably logged their prepare records when the coordinator's machine catches fire and won't be recoverable for hours. What's the practical problem this exposes?

Ch. 13 · Quiz2 / 4

Multiple choice

Google Spanner achieves externally consistent distributed transactions despite running across data centers. The key idea is that TrueTime exposes clock uncertainty as a bounded interval, and Spanner does what at commit time to make timestamps safe?

Ch. 13 · Quiz3 / 4

Multiple choice

A team is designing a long-running booking workflow (reserve flight, reserve hotel, charge card) that may take minutes and span independent services. Holding distributed locks for that long is unacceptable. Which pattern fits, and what does it give up?

Ch. 13 · Quiz4 / 4

Multiple choice

3PC inserts a pre-commit phase between prepare and commit to let participants infer the outcome from peers if the coordinator dies. Why is 3PC still rarely used in real systems?

Ch. 14

Consensus

The final chapter: how a group agrees on a single value (and, by extension, a sequence of values forming a replicated log) despite crashes, message loss, and reordering. Develops Paxos and its variants, Raft, ZAB, and Viewstamped Replication, all under the shadow of FLP.

Ch. 14

FLP Impossibility

In a purely asynchronous system with even one crash-faulty process, no deterministic consensus protocol can guarantee both safety and liveness. Real systems sidestep FLP via partial synchrony, randomization, or failure detectors — every working consensus protocol relies on at least one.

Ch. 14

Basic Paxos

A proposer runs a two-phase ballot: phase 1 (prepare/promise) acquires a quorum's promise not to accept lower-numbered proposals and learns any previously accepted value; phase 2 (accept/accepted) gets a quorum to accept a value. Safety is unconditional; liveness can be lost to dueling proposers.

Ch. 14

Multi-Paxos

Amortizes Paxos over a log by electing a stable leader who skips phase 1 for subsequent slots and runs only phase 2 per command. This is what production "Paxos" deployments actually mean and is structurally close to Raft.

Ch. 14

Fast Paxos

Lets clients send proposals directly to acceptors, skipping the leader on the fast path. The cost is a larger fast quorum (3f+1 acceptors to tolerate f faults instead of 2f+1) and a recovery path on collisions.

Ch. 14

EPaxos

Egalitarian Paxos: any replica can commit non-interfering commands in one round-trip; only conflicting commands need an extra dependency-resolution round. Improves latency in geo-distributed deployments and avoids the single-leader bottleneck entirely.

Ch. 14

Raft

A consensus protocol designed for understandability, decomposed into leader election, log replication, and safety. The leader is the sole point of log appends, terms monotonically increase, and a strict log-matching property plus restricted leader election guarantee committed entries are never lost.

Ch. 14

ZAB

ZooKeeper Atomic Broadcast: a leader assigns monotonically increasing zxids to proposals and broadcasts them in order; after leader failure, a recovery phase synchronizes followers to the new leader's history. Provides primary-order atomic broadcast — a totally ordered, FIFO-per-client log.

Ch. 14

Viewstamped Replication

One of the earliest leader-based consensus protocols (Oki & Liskov), organized around views, each with a designated primary. A view change protocol elects a new primary and transfers state when the previous one is suspected. Structurally equivalent to Paxos and a direct ancestor of Raft.

Ch. 14 · Vocab
Consensus
Agreement on a single value satisfying agreement, validity, and termination.
Quorum
A subset large enough that any two such subsets intersect (typically a majority).
Proposer / acceptor / learner
The three Paxos roles, usually co-located in one process.
Ballot number
Unique, monotonically increasing Paxos proposal ID.
Ch. 14 · Vocab
Term
Raft's analog of a ballot; a logical period with at most one leader.
Leader election
Sub-protocol choosing a node to coordinate within a term/view.
Log replication
Leader propagates ordered command entries to followers.
Replicated state machine
Abstraction that consensus implements — same inputs, same state, everywhere.
Ch. 14 · Vocab
View change
VR/ZAB recovery procedure installing a new primary and reconciling state.
Partial synchrony
Eventually-bounded delay assumption that recovers liveness despite FLP.
Ch. 14 · Quiz1 / 5

Multiple choice

You're explaining to a junior engineer why every working consensus protocol — Paxos, Raft, ZAB, VR — relies on at least one of: timeouts, randomization, or a failure detector. Which theoretical result are you invoking?

Ch. 14 · Quiz2 / 5

Multiple choice

A distributed log service is built on Multi-Paxos. After electing a stable leader, each subsequent log entry is committed with only the second phase of Paxos. What did the leader save by being stable, and why is this safe?

Ch. 14 · Quiz3 / 5

Spot the issue

A team wants the latency benefit of Fast Paxos in a small 3-node cluster (f=1, so 2f+1=3 acceptors). They're confused because their fast path keeps failing. What constraint of Fast Paxos are they missing?

Ch. 14 · Quiz4 / 5

Multiple choice

Raft was designed for understandability and decomposes consensus into three sub-problems. Which decomposition is correct?

Ch. 14 · Quiz5 / 5

True / False

Two proposers in basic Paxos can keep pre-empting each other's ballots forever ("dueling proposers"), and when this happens the protocol's safety guarantee — that no two acceptors decide different values — is violated.

Key Takeaways

01

Every storage engine boils down to one choice — mutate in place (B-tree) or append-only (LSM) — and that single choice shapes its read/write/space amplification profile.

02

On-disk B-trees survive because high fanout collapses height to 3-4 levels, so a billion-row lookup costs only a handful of page reads, but only if the slotted-page layout, free-space management, and structural-modification rules are correct.

03

ACID is implemented by a tight loop between buffer manager, WAL, lock manager, and recovery — and ARIES (steal/no-force + UNDO/REDO + CLRs) is still the canonical recipe decades later.

04

Distributed databases live under FLP, CAP, and the fallacies of distributed computing — you cannot have safety, liveness, and full async tolerance at once, so every real system trades one for partial synchrony, timeouts, or quorums.

05

Replication is a spectrum of consistency models, not a binary — linearizability, sequential, causal, and eventual sit on a ladder, and quorums (R+W>N), read repair, hinted handoff, and vector clocks are the tools that climb it.

06

Consensus — basic Paxos, Multi-Paxos, Raft, ZAB, VR — is the single primitive everything else (leader election, atomic broadcast, distributed transactions, replicated state machines) reduces to.