Petr Kuznetsov - Large-Scale Byzantine Fault Tolerance: Prevention and Detection
Speaker(s) : Petr KuznetsovDealing with arbitrary faults in large-scale computing systems is an interesting but difficult task. In this talk, we discuss two ways of tackling this task: prevention and detection.
Byzantine-fault-tolerant replication (BFT) is a popular prevention-oriented technique. It hides the effect of arbitrary faults, assuming that the faults are independent and, thus, can be bounded. This assumption makes sense for simple hardware failures but does not seem to hold water for software bugs or security attacks. Besides, the assumption does not scale with the number of groups of replicas running BFT: the probability of at least one group violates the assumption increases rapidly as the number of groups increases. We address these problems with a simple modification to Castro and Liskov's BFT replication that allows for arbitrary choice of n (number of replicas) and f (failure threshold). The modified BFT protocol does not lose safety even when failures are many, performs very well when failures are few, but may lose liveness in the intermediate cases. The price to pay is a more restrictive liveness requirement, and we discuss the ways how to obviate this problem.
An alternative approach is to detect rather than mask failures. We advocate practicality of this approach by presenting PeerReview, a system that provides accountability in distributed systems. The system ensures that Byzantine faults whose effects are observed by a correct node are eventually detected and irrefutably linked to a faulty node. At the same time, the system ensures that a correct node can always defend itself against false accusations.