Séminaire REGAL

Strong consistency with low latency random reads & writes on a distributed database: Apache HBase choices and challenges

Пятница, апрель 17, 2015
Nicolas Liochon

HBase is for random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Bigtable is used by Google for gmail, Google docs and other products. They store 2.5 exabytes of data, and do 600 million operations per second. HBase is used by Facebook for its messaging (1 million ops/s), Bloomberg, Xiaomi and others. Cluster size goes from 15 to 2000 nodes.

I will cover mostly how HBase is available (not as in CAP) and strongly consistent even in case of failure, and the some of the challenges in multi sites deployments. HBase uses replicated Write-Ahead-Logs, failure detectors and replay the WAL in parallel on failure. Failure detectors bring the strong consistency property, but impact negatively the latency (interestingly, Chandra is a coauthor of many papers on failure detectors and also a coauthor of the Bigtable paper at Google). I will introduce some tricks that HBase added to minimize the Mean-Time-To-Recover. HBase, as Bigtable, is eventually consistent between datacenters. This breaks the initial design goal of HBase, and introduces a dataloss risk. Synchronous replication is not that easy: Google initially added another component (Megastore) on top of Bigtable, then developed Spanner. Spanner supports natively synchronous replication. It seems that adding cross site synchronous replication has been tried at all the possible layers (disk, file system, distributed file system, distributed database, 'meta database', application). Is there a best-of-class approach that one should just copy? Here I will just present the existing asynchronous mechanism in HBase.

Более подробно …
Marc.Shapiro (at) nulllip6.fr