Readings

A good starting point is Andy Pavlo's "What's New with NewSQL" survey paper on NewSQL, which also covers the history of SQL and NoSQL databases. You might also find some lectures from his database internals class useful, especially if you need a crash course on a specific topic like query planning or concurrency control.

Academic papers on Spanner/F1 (and related predecessors) are also a valuable resource, since Spanner was the original inspiration for CockroachDB:

The Spanner itself.
The F1 system builds more features on top of Spanner, some of which might make it to Cockroach.
Online Asynchronous Schema Change in F1 describes how to evolve schemas in F1.
The Spanner paper assumes in many places an understanding of Bigtable and Colossus, so you might want to read those (at least the BigTable paper for sure) before attempting to understand Spanner.

Once you get the high level picture, you might want to focus on some specific building blocks of CockroachDB in more detail. Here are some subtopics, although this section could use some more expansion.

Distributed systems lower levels:

Raft provides the lowest level building block on which we distribute work: establishing consensus.
- The secret lives of data explains Raft through an easy-to-understand animation.
- The extended version of the Raft paper is a good next step.
- Raft's github page has an in-browser cluster that you can play around with as well as links to publications, talks, and more.
- Many details that are omitted from the Raft paper can be found in Diego Ongaro's PhD thesis.
We use gossip for efficiently transmitting information as well.
- The Promise, and Limitations, of Gossip Protocols is a good survey of Gossip. Not everything can (or should) be gossiped, and this paper covers these tradeoffs.
Concurrency control is best covered by Andy Pavlo's 15-721 lectures (lecture 3-5).
Jeff Hodge's Notes on distributed systems from 2013.

Database internals:

Trying to understand a complex database concept that is new to you can be very difficult, particularly in a distributed setting. The MySQL documentation can be a good place to start, so that you first get a good grip of the single-machine setting before extending it to the distributed setting:
- The InnoDB storage engine is the underlying building block of MySQL.
- The MySQL optimization reference is useful for understanding indexing and other query optimizations in the single-machine setting.

Query optimization:

Other database systems:

Apache Calcite. Some slides here
Spark SQL
Apache Trafodion
Vertica (evolved from C-Store)
VoltDB (evolved from H-Store)
ql: go embedded database, potentially interesting to look at implementation details
TiDB: SQL over distributed transactional KV in go

Other readings (perhaps non-DB specific):

MTS sketch for cardinality estimation.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readings

Clone this wiki locally