Tangle experimental results

Results

See the experiments section for an overview. This page presents the details of the experiments and the graphs of the results.

Except where otherwise noted, the experiments use the following setup. There are 32768 nodes, initially organized as a linear chain (every node's routing table contains at most peer, and every node appears in at most one peer's routing table). The network endpoints associated with each node are assigned at random (there is no correlation between the initial logical graph and the network delays). While the simulation runs, send operations are initiated at random, from any node to the label of any other node. At first, the send operations all fail, because the routing tables are nearly empty, but as the system runs the routing tables become full and increasingly random. The systems settles into an equilibrium after about 700 to 1100 seconds (depending on the faults that are injected), but the system is warmed up for 4000 seconds, just to be safe. For dynamic faults (like many nodes suddenly joining the system at once) send operations are performed over time (20 per second) as the system reacts, and the response times are recorded. For steady-state faults (like persistent packet loss rates) the routing tables are frozen and 20000 response times are collected. Freezing the routing tables is just a trick for taking many different measurements of the same conditions. Each experiment is run twenty times and the data is aggregated.

Steady-state faults

For each steady-state fault, the distribution of response times is graphed for the two forwarding styles (deterministic and hybrid). Response times greater than 20 seconds are not shown, because the implementation gives up after 20 seconds. In order to reveal the details of both ends of the curves, each graph is shown twice, first with the top portion omitted, then with the left portion omitted.

No faults

This experiment injects no faults at all; every packet reaches its destination, and every node is present from the beginning and performs the protocol correctly. The purpose is to test whether hybrid forwarding incurs any cost when everything is working perfectly.

[no-faults graph] [no-faults graph (zoom)]

With no faults, there is almost no difference between deterministic and hybrid forwarding, because the first packet sent almost always reaches its target. In about one of every thousand send operations the packet does not reach its target, because the routing tables are never 100% complete. Tangle is a soft-state protocol that continually adds new entries to the routing tables and discards old entries, so the system is always in a state of flux. This is not considered a problem in Tangle because retransmitting packets along different paths takes care of it, as the hybrid curve shows. Of course, retransmitting the packet along the same path would not help, which is why the deterministic curve never reaches 100%.

Packet losses

These three experiments inject packet losses at various points. In the first experiment, it is the network that is lossy: every packet has a 10% chance of being dropped. In the second experiment, the losses are concentrated at receivers: 20% of the nodes are lossy receivers, where each arriving packet has a 50% chance of being dropped. In the third experiment, the losses are concentrated at senders: 20% of the nodes are lossy senders, where each outgoing packet has a 50% chance of being dropped.

Lossy network:
[lossy-network graph] [lossy-network graph (zoom)]

Lossy receivers:
[lossy-receivers graph] [lossy-receivers graph (zoom)]

Lossy senders:
[lossy-senders graph] [lossy-senders graph (zoom)]

In all three experiments, hybrid forwarding incurs a small performance penalty (around 20-30%) for some of the common case, but it reduces the occurrence of failure by more than an order of magnitude. For deterministic forwarding, the failure rate is about 1/500 for the unconcentrated losses, and about 1/100 for the concentrated losses.

Route losses

These two experiments inject route losses. Of the N^2 end-to-end network-layer routes connecting the N nodes, 19% are broken, so that packets trying to traverse them are lost. In the first experiment, the broken routes cover all the sources, but are concentrated at 20% of the destinations, which are each invisible to (unreachable by) 95% of their peers. In the second experiment, the broken routes cover all the destinations, but are concentrated at 20% of the sources, which are each blind to (unable to reach) 95% of their peers.

These experiments omit send operations with invisible senders, invisible targets, or blind targets, because such operations would usually fail even if the sender already knew the address of the target and sent the query directly there. Also, blind and invisible nodes are not essential links in the initial chain, because that would provide too many opportunities for the population to become partitioned very early in the run. Instead, some of the normal nodes are initially known by two peers, one of which is normal.

Invisible peers:
[invisible-peers graph] [invisible-peers graph (zoom)]

Blind peers:
[invisible-peers graph] [invisible-peers graph (zoom)]

In both experiments, hybrid forwarding incurs no penalty at all, and reduces the occurrence of failure by more than four orders of magnitude. For deterministic forwarding, the failure rate is about 1/30 for invisible peers and 1/10 for blind peers.

Software bugs

This experiment injects a software bug into 10% of the nodes, causing them to silently fail to utilize their routing tables to forward messages, although they continue to perform the rest of the protocol (constructing and maintaining the routing tables) perfectly.

[buggy-peers graph] [buggy-peers graph (zoom)]

Hybrid forwarding incurs no penalty at all, and reduces the occurrence of failure by more than four orders of magnitude. For deterministic forwarding, the failure rate is about 1/3.

Membership churn

This experiment injects membership churn; that is, a population of transient nodes, each of which joins the system, stays for a short time (about 500 seconds, just long enough to become well-connected) and then leaves system, never to return. The rate of arrivals and departures is just enough to maintain an average transient population of 24576, which outnumbers the permanent population of 8192 by a ratio of 3:1. The total population at any moment is about 32768, as usual. Every send operation is from a permanent node to the label of another permanent node, so that the experiment measures not the performance of the transient population but rather its negative impact on the permanent population.

[churn graph] [churn graph (zoom)]

Hybrid forwarding incurs no penalty at all, and reduces the occurrence of failure by more than two orders of magnitude. For deterministic forwarding, the failure rate is more than 1/2.

Dynamic faults

For each dynamic fault, the success rate (the percentage of send operations that complete within 20 seconds) is graphed versus time, for the two forwarding styles. The dynamic event happens at time 4000 seconds. Note that axes do not generally start at zero.

Membership burst

In these two experiments, half of the 32768 nodes are permanent; the other half are used to perturb the system. In the first experiment, the non-permanent nodes are present from the beginning and leave en masse. In the second experiment, the non-permanent nodes are not present at the beginning; later they join en masse. As in the membership churn experiment, every send operation is from a permanent node to the label of another permanent node.

Mass leave:
[mass-leave graph]

Mass join:
[mass-join graph]

The mass leave is very disruptive for both forwarding styles, but there is a 30-40% spread between their failure rates during most of the recovery.

The mass join is less disruptive than the mass leave, causing a failure rate of up to about 20% for deterministic forwarding, and only about 0.2% for hybrid forwarding.

Scaling

This experiment injects no faults, but varies the number of nodes in powers of two, and looks at the steady-state mean response time and the settling time. The steady-state mean response time is the mean response time during the last 1000 seconds of a 4000 second run. The settling time is the elapsed time until the filtered (smoothed) response time falls to within a factor of two of the steady-state mean. The following graph illustrates the metrics for a single run.

[single-run graph]

The next pair of graphs show how the two metrics change with the number of nodes.

Settling time:
[settling-time scaling graph]

Steady-state mean response time:
[response-time scaling graph]

The settling time increases at a scalable rate, a little bit faster than log(N). The steady-state mean response time increases slowly with log(N), about 7 ms for each doubling.

Prepared by Adam M. Costello
Last modified: 2004-Jan-16-Fri 01:14:25 GMT