SIGMOD 2020 (Industry Track) · 2020 ·
My reading notes
This is the canonical industry reference for the geo-distributed SQL layer Arun's MIRROR platform serves on (K8s across regions), and its hybrid-logical-clock plus Raft-per-range design is directly relevant to running stateful, consistency-sensitive services across data centers. It is also a strong system-design-interview anchor for distributed-systems and ML-systems roles.
CockroachDB (CRDB) is a shared-nothing, horizontally scalable relational database built to serve global OLTP workloads while keeping strong consistency and surviving disk, node, availability-zone, and full-region failures. The motivating scenario is a company with users across continents that must satisfy data-residency rules (e.g., GDPR domiciling), keep data physically close to the users touching it, stay available through regional outages, and still expose familiar SQL with serializable transactions. CRDB's stated differentiator is that it achieves all of this on off-the-shelf hardware with ordinary NTP-grade clocks, rather than the GPS/atomic-clock TrueTime infrastructure that Google Spanner depends on.
Architecturally the system is layered: a PostgreSQL-dialect SQL layer with a Cascades-style cost-based optimizer and a distributed (Volcano-style) execution engine on top, a transactional key-value layer, a distribution layer that maps everything into one monolithic ordered keyspace, a consensus-replication layer, and a local RocksDB-backed storage engine. Data is range-partitioned into ~64 MiB contiguous "Ranges" that split, merge, and rebalance automatically (including load-based splits to break up hotspots); each Range is by default replicated three ways across distinct nodes via its own Raft consensus group, with one replica holding a lease to serve fast local reads.
Consistency rests on two ideas. First, a hybrid-logical clock (HLC) per node combines coarse physical time with Lamport logical counters to track causality, enforce disjoint Range leases, and stay monotonic across restarts. Second, each transaction carries an uncertainty interval of width equal to the configured max clock offset (default 500 ms); a transaction that sees a value inside that window performs an uncertainty restart, which together with lease safeguards yields single-key linearizability and serializable isolation even under clock skew (skew beyond bounds can only cause stale reads, and a node that drifts past 80% of the bound relative to its peers self-terminates). A Parallel Commits protocol, formally model-checked in TLA+, lets a transaction commit in a single round of consensus by staging its status concurrently with replicating its writes.
The evaluation shows near-linear TPC-C scaling (about 12.5K tpmC at 10K warehouses up to ~124K tpmC at 100K warehouses, at near-maximum efficiency), high efficiency relative to published Amazon Aurora numbers, and generally higher throughput and lower tail latency than Cloud Spanner on YCSB (except update-heavy, high-contention Workload A). Geo-partitioning and "follow-the-workload" leaseholder placement keep latency low while satisfying residency constraints. A candid lessons-learned section covers the realities of running hundreds of thousands of Raft groups (heartbeat coalescing, pausing idle groups, adopting Joint Consensus for atomic membership changes) and the difficulty of removing SERIALIZABLE-related retries for application developers, which led them to drop standalone snapshot isolation and alias it to serializable.