Adam M. Costello, advisors
Randy H. Katz (current) and Steven McCanne
Computer Science Division
University of California at Berkeley
One typical application of peer-to-peer architectures is to enable end users, with their large number of small machines, to provide a large-scale service to themselves by pooling their resources. In such a setting fault tolerance is more important than ever, because there is no paid professional staff to monitor the system and fix problems. At any time some of the nodes will be unreliable, buggy, or flakey in unforseen ways, and the system needs to function in spite of this. One technique that peer-to-peer protocols can use to enhance their fault tolerance is randomized forwarding. Tangle is a distributed hash table that uses this technique. Experiments demonstrate the positive impact of randomized forwarding on Tangle's fault tolerance.
A hash table is a method for storing a set of (key,value) pairs and supporting the operations get(key) and put(key,value). A hash function maps each key to a hash value, where the hash values are uniformly distributed over the space of all possible hash values. The hash values determine where the pairs are stored in memory, allowing the get() and put() operations to be implemented cheaply.
A distributed hash table is a hash table in which the pairs are not all kept on a single node, but are spread across many peer nodes, so that the total table can be much larger than any single node could accomodate. Henceforth we will use the term label instead of hash value, because the hashing of the keys will typically be performed at a higher layer; the distributed hash table will see the labels, not the keys, and will provide the operations get(label) and put(label,value).
In order to allow the number of nodes to scale up, we cannot assume that every node knows about the existence of every other node. Therefore, mapping a label to the corresponding location will involve the cooperation of many nodes, and hence the forwarding of messages between nodes. The primative operation is send(message,label), which conveys the message to whichever node is responsible for that label. This application-level routing layer (not to be confused with the network-level routing layer) lies below the hash table layer, which calls the routing layer to forward “get” and “put” requests to the proper nodes.
The routing layer performs two tasks. It constructs and maintains routing tables (the partial information that nodes have about other nodes) as nodes join and leave the system; and it uses the routing tables to forward messages to their proper destinations.
In Tangle, routing is based on label prefixes. Node addresses are hashed to labels, and send(message,label) attempts to forward the message to the node whose label shares the longest common prefix with the given label. Each node tries to maintain enough information in its routing table so that it always knows a random sample of nodes that are a closer match (if any closer matches exist). The next hop is chosen randomly from among the known closer matches, and the pool of closer matches is continually being replaced by newly discovered random peers. More details are given in the protocol section.
I have written a real implementation of the Tangle protocol in C, on top of abstract transport and timer interfaces. Those interfaces are meant to be glued to the real transport and timer facilities provided by the operating system (like the socket interface), but I did not write that glue. For performing experiments, I wrote a simulated network and clock (also in C), so that all the nodes run on a single machine in a single thread. Because communication in Tangle is connectionless, intermittent, and very low-bandwidth, there is very little to be gained by simulating the network topology. It is sufficient to simulate realistic end-to-end network delays, which is much simpler. With a simple simulated network and with everything written in C, the experiments were able to scale up to 32768 nodes, running at about 40% real-time on one 864 MHz Pentium III, using about 85% of the 512 MB of memory.
The experiments focus on the routing layer, measuring the success rate and completion time of send operations. Each send operation retransmits the message at an average rate of twice per second, until a successful end-to-end acknowledgement is received or until 20 seconds have elapsed. The hash table layer would normally provide additional fault tolerance by storing the (label,value) pairs at several random nodes whose labels are close to the target label, but that additional fault tolerance is mostly due to the replication, not the randomization, and therefore is not studied. The hash table layer is not used for the experiments.
Each send operation is either purely deterministic or a hybrid of determistic and randomized. For a purely deterministic send, the initial message and all the retransmissions are forwarded along the same path, the one estimated to be the fastest (each node independently estimates the best next hop). For a hybrid send, the initial message is forwarded along the estimated fastest path (same as before), but the retransmissions take random paths (each node independently chooses a random next hop that moderately favors faster next hops).
Each experiment injects a different kind of fault, including packet losses, route losses, software bugs, and nodes joining and leaving the system. See the results section for details of the individual experiments and the resulting graphs. The graphs show that deterministic forwarding yields a mix of very good and very bad performance, and that hybrid deterministic/randomized forwarding preserves the good performance while greatly improving the bad performance.
Randomized forwarding is a simple mechanism that yields a substantial benefit for very little cost. The results are consistent across a wide range of contrived faults, which suggests that randomized forwarding would be effective against many unforseen faults.
Distributed hash tables can generally be viewed as having two layers: a routing layer that provides the send(message,label) operation, and a hash table layer that builds the get(label) and put(label,value) operations on top of send(message,label) (although in a real protocol send() is likely to be built in to get() and put(), not a separate interface).
This section summarizes the routing layer of the Tangle protocol. The full story (including both layers) is contained in the Tangle specification.
In Tangle the routing layer is further divided into two separate activities: peer discovery (constructing and maintaining the routing tables), and forwarding (using the routing tables to relay messages). Other distributed hash tables typically combine these two activities; a node sends a message to its own label, and uses the traversed nodes to populate its routing table. That's a clever trick, but the circular dependence between peer discovery and forwarding is difficult to reason about, especially if all the nodes join the system concurrently.
In Tangle, the peer discovery process makes no use of forwarding; instead it uses an adaptation of the Name-Dropper protocol. In Name-Dropper each node starts out knowing a few peers and continually chooses a peer at random and sends its entire list of peers, and soon every node knows every other node. In Tangle nodes want not complete knowledge of their peers but merely a random sample of them; therefore they include only a few peers (chosen randomly) in each sent list, and they forget peers (chosen randomly) to keep their peer sets bounded. Starting from any initial connected graph, the known peer sets soon become random.
Another difference from Name-Dropper is that a Tangle node organizes its peers into two classes: unconfirmed peers (whom it has heard of but not heard from) and confirmed peers (whom it has heard from first-hand). The node continually chooses an unconfirmed peer at random and handshakes with it. The two nodes add each other to their confirmed sets, and each adds a few of the other's confirmed peers to its own unconfirmed set.
Actually, the confirmed and unconfirmed sets are partitioned into levels, one for each prefix length. At level k are peers with labels whose longest common prefix (compared to the node's own label) has length k. Labels in Tangle are bit strings. A node belongs to many communities: the community of all nodes, the community of nodes that share a 1-bit prefix with it, the community of nodes that share a 2-bit prefix with it, and so on. The node effectively participates in peer discovery with each of those communities, although the process is combined into a single protocol. The result is a uniform arrival rate at all levels.
The levels of confirmed peers constitute the routing table, and are used for forwarding. When a node receives a message destined for a particular label, it compares that label to its own label and computes the longest common prefix. Peers at that level have labels that differ from the node's label in the next bit, which means they agree with the message's target label in the next bit, which means they are closer matches. If any confirmed peers are known at that level, one is chosen at random, and the message is forwarded to that peer. Since at least one bit is gained per hop, and the node labels are uniformly distributed among the N nodes, the number of hops is O(log N).
tangle-source.tar.gz (7 MB, version 0.0.0, released 2003-Aug-11-Mon)
Jeff Garzik is continuing development of a Tangle router daemon, adding the glue code that was missing from my source.
Tangle is just one of several scalable distributed hash table protocols all devised independently at about the same time. (Apparently, it was an idea whose time had come.) Other distributed hash tables are CAN (Content-Addressable Network), Chord, Pastry, and Tapestry. The theory underlying the scalability of all these protocols has been called the Small-World Phenomenon.