Byzantine Fault Tolerance (BFT)

Byzantine fault tolerance (BFT) is defined as the degree of certainty that a distributed computer system has about the status of a node when there is imperfect information on whether the node is correct or not. With Ethereum Proof of Stake, BFT is the network’s ability to defend against a “bad” node that is messaging the network to send funds to the wrong place while allowing other nodes to reach consensus and form a new block. The development of Casper is centered on the question of how fat or slim to make the Ethereum proof-of-stake BFT layer.

Proof-of-work consensus algorithms prevent double-spending with a solution to a nothing-at-stake problem called the “Byzantine Generals Problem” that requires miners to expend enormous amounts of electricity and computer processing power. Casper tries to solve the Byzantine Generals Problem with a proof-of-stake algorithm that replaces Ethereum’s proof-of-work algorithm, thus providing transactional finality using much less electricity and equipment.

The term “Byzantine Generals Problem” is derived from a hypothetical situation where a group of generals from the Byzantine Empire, each commanding a battalion, must decide whether to attack a certain city when the Byzantine army can only win such a battle if all battalions attack in unison, but where messages between the generals may be unreliable. The city’s huge army can only be defeated if the generals and their battalions all attack at once.

Each general and his battalion are far apart from every other general and their battalions, with no central authority, making a coordinated attack possible only if all the generals agree on the validity of a single message, carried by messengers, instructing them to attack. If any number of generals decide to order their battalions to attack the city, and only one general orders a retreat or hesitates to attack either because he received a “bad” message or he is traitorous, there will be an uncoordinated attack that results in the defeat of the attacking Byzantine army.

With Ethereum, the Byzantine Generals Problem is an agreement (consensus) problem solved by Casper PoS. Nodes must reach consensus regarding the addition of the correct block to the blockchain to avoid the potential for double spend. A “Byzantine Fault” in this context is a node presenting different network symptoms to different nodes and causing some nodes to follow non-protocol behavior. A “Byzantine Failure” of the Ethereum network would be a network breakdown due to a Byzantine fault where a node presents different symptoms to different observers, thus appearing both failed and functioning to Ethereum’s failure-detection system.

How does a BFT-style proof-of-stake algorithm work?

To achieve BFT-style PoS, Casper specifies two kinds of rules:
1. Finality rules that determine when the hash is considered finalized.
2. Slashing rules that determine when a validator node, identified by its stable Ethereum address, is deemed beyond a reasonable doubt to have misbehaved (e.g. voting for multiple conflicting blocks at the same time), with a violation resulting in the node’s deposit being deleted.

Proof-of-work algorithms require a block winner to find a nonce for the appropriate hash target. This is computationally intensive and time-consuming, but the process of verifying the result is very simple. BFT-style proof-of-stake algorithms, such as Casper, reward nodes who “vote” correctly on blocks, but verifying the result is difficult. Block validation is done through multiple rounds of voting where every node sends votes for some specific block during each round. All honest and online nodes permanently agree on whether or not any given block is “canonical” (part of the chain).

On a network that has the potential for Byzantine failures to occur, it is difficult for other nodes to declare a node “bad” and shut it out of the network because they must first reach consensus on which node is bad. Imagine that Ethereum validator nodes are the generals, and their network links are the messengers. Byzantine fault tolerance is achieved if “good” nodes (“loyal generals”) come to a majority agreement, where a default null vote value is given to missing messages and messages from bad nodes, and if <Null> votes are in the majority, a pre-assigned default strategy of no consensus (“retreat”) is used.

BFT defends against a bad node when all nodes cannot agree on which node is bad, but where such an agreement must be made for the network to continue functioning. A fork created by >50% of good nodes scores higher than any fork created by the remaining potentially-bad nodes, but nodes cannot be certain that a fork created with 51% participation won’t be reverted because they cannot know whether some of the nodes are Byzantine. Therefore, Casper nodes will only consider a block as finalized if it has the participation of a supermajority of nodes.

Byzantine fault tolerance is 50% in a synchronous PoW network and 33% in an partially-synchronous or asynchronous PoS network. The 50% fault tolerance for PoW assumes zero network latency, with fault tolerances of around 46.0% for Ethereum and 49.5% for Bitcoin being observed under actual conditions. Fault tolerance goes down to 33% when PoS network latency equals block time, so the solution works as long as the number of bad nodes does not equal or exceed one-third of all nodes.