Tangle specification 0.3.0 (2003-Jul-29-Tue) Adam M. Costello http://www.nicemice.net/amc/ Introduction ============ Tangle is a peer-to-peer protocol for mapping resource labels to resource locations; it is a kind of distributed hash table. The design of Tangle has been influenced by a particular higher-layer application, Buzz, which is like the Web except that resources are not associated with any particular servers. This specification is not final. Some details have yet to be filled in, and there is not yet a strong commitment to backward compatibility as the specification evolves. Model ===== The same Tangle protocol is run on every node. Each node has a node address, plus a message address for sending Tangle protocol messages to the node, and a connection address used by higher layers (like Buzz, which uses the connection address for transporting entities). The message and connection addresses are relative to the node address, and are useful only in conjunction with it. (For example, on the internet, the node address is an IP address, the message address is a UDP port number, and the connection address is a TCP port number.) A well-known one-way function, applied to the node address, yields the node's label, which is a bit string. Resources also have labels, and a node will tend to know the locations of resources whose labels are "close" to the node's label. The closeness of two labels is measured by the length of their longest common prefix. The one-way function from addresses to labels has not yet been defined. Data structures =============== Historic peers -------------- Each node maintains a list of historic peers, where each peer consists of a node address and a message address. This list is kept in stable storage, and changes relatively slowly while the node runs. Before a node is run for the first time, this list needs to be initialized manually. Peers are removed from the list only as they are added, so that the list never shrinks. Whenever a peer is added to the list, it is added at the front. Then, if the list length exceeds some threshold r, one peer is removed as follows: With probability p, the front peer is removed, otherwise with probability p, the next peer is removed, and so on. If the list is exhausted before the process stops, then no peer is removed. Therefore the list grows over time, and its expected length is O(log(t)). More precisely, after n additions to an initially empty list, the length is n for n <= r, and about ln(cn-cr+exp(cr)) / c for n > r, where c = ln(1/(1-p)). [PARAMETER: The probability p, and the threshold r.] [The goal of this procedure is to achieve the following property: If peers are added at a uniform rate, then the peers in the list will have ages such that log(age) is uniformly distributed. The purpose of the threshold is to grow the list quickly at the beginning.] Unconfirmed peers ----------------- Each node X maintains an array of sets of unconfirmed peers, one set for each bit position in its label. The set corresponding to the first bit position is called the "top" set, and the set corresponding to the last bit position is called the "bottom" set. Each unconfirmed peer consists of its node address and its message address. The set corresponding to particular bit position contains peers whose labels disagree with X's label in that position, but agree with X's label in all earlier positions. Notice that for any label, there is exactly one set that is appropriate for that label. (Exception: there is no set appropriate for the node's own label.) Sets do not contain duplicate node addresses; if a peer is being added with the same node address as a peer already in the set, the new peer replaces the old one (the message addresses might be different). There is an upper limit on the size of a set; whenever a set momentarily exceeds this limit (which should be finite), a member is chosen uniformly at random and removed, until the set size is back down to the limit. [PARAMETER: Unconfirmed peer set limit.] Syntactically invalid addresses should not be admitted into the sets. Confirmed peers --------------- Each node X maintains an array of sets of confirmed peers, one set for each bit position in its label, similar to the unconfirmed array, but a confirmed peer contains more data: node and message and connection addresses of the peer label of the peer (computable from addresses) round-trip time between X and the peer node parameters of the peer The node parameters are: uptime (how long the peer had been up when it was added to the set) bandwidth between the peer and the "core" of the network free space on the peer (available storage) Confirmed peers are removed from the confirmed set whenever they've been in the set too long. [PARAMETER: Confirmed peer age limit, which should vary in proportion to sqrt(max(n,1)), where n is the number of nonempty unconfirmed peer sets above the uppermost empty one.] Implementations may also limit the size of confirmed sets in the same way they limit the size of unconfirmed sets (and should do so if operations on the set take O(n) time). [PARAMETER: Confirmed peer set limit.] Like unconfirmed sets, confirmed sets do not contain duplicate node addresses; if a peer is being added with the same node address as a peer already in the set, the new peer replaces the old one (and the age is reset to zero). A collection of node & messages addresses of recently confirmed peers should occasionally be written to stable storage, to be used to initialize the unconfirmed peers when the node starts up next time. Location map ------------ Each node maintains a map from resource labels to sets of locations. A location consists of: node and message and connection addresses of a node caching the resource resource parameters The resource parameters are: timestamp of the resource (which version is cached) completeness of the resource (how much of it is cached) old flag (is this version thought to be out of date?) Locations are removed from the map whenever they reach a certain age (counted from the time they were added to the map). Implementations should not impose any limit on the number of resources in the map. They may impose a limit on the number of locations in the set corresponding to a single resource label (and should do so if operations on the set take O(n) time). When a set momentarily exceeds this limit, remove one location, either the oldest or one chosen uniformly at random. [PARAMETER: Location map set limit.] Among the locations corresponding to a single resource label, each pair appears at most once. Adding a new location that would collide with an existing location causes the old location to be replaced by the new location (whose age is then zero). There is an invariant regarding the old flags: Within the set of locations corresponding to a single resource label, a location's flag is turned on if the set contains another location with a strictly newer timestamp. Whenever a location is added, flags are turned on in order to preserve the invariant. Flags are never turned off. For resource parameters appearing outside the location map (in advertisements and local resources, see below), the flags are not automatically changed, and there is no invariant. Advertisement buffers --------------------- Each node X maintains an array of buffers of advertisements, one buffer for each bit position in X's label. Each advertisement consists of: label of a resource location of the resource The buffer for a particular bit position contains advertisements whose labels disagree with X's label in that position, but agree with X's label in all earlier positions (the same rule as for the peer sets). There is a limit on the occupancy of a buffer, and a limit on how much time an advertisement may stay in the buffer. When either limit is reached, all advertisements in the buffer are removed and sent in a post message. [PARAMETER: Advertisement buffer occupancy and age limits. The occupancy cannot exceed 255, because a post message cannot hold more than 255 advertisements. An implementation can additionally impose another limit to keep the message size below some maximum.] Local resources --------------- Each node maintains a set of local resources. For each resource located at that node, the set contains an element consisting of: label of the resource resource parameters (See the location map section above for the resource parameters.) Message types ============= Every message contains a group identifier, which indicates which population of nodes the message belongs to. Messages are not intentionally sent from one group to another. Each group agrees on a single version of Tangle, and on a higher-layer protocol to use on top of Tangle. There are five types of messages, with the following meanings and contents (in addition to the group identifier): post: "Store these resource locations in the distributed hash table." list of advertisements query: "Tell me locations for this resource." node and message addresses of the requestor label of the resource cookie (opaque data to be copied blindly) report: "This resource can be found at these locations." node and message addresses of the final destination (optional) cookie (if evoked by a query) label of the resource list of locations (empty means not found) ping: "Are you there?" node and message addresses of the sender cookie (opaque data to be copied into a pong) pong: "Yes, I'm here, and here are some other nodes to try." cookie (optional, to be copied into another pong) cookie (copied) node and message and connection addresses of the sender list of node and message addresses of other nodes node parameters of the sender Protocol activities =================== Quasi-periodicity ----------------- Something is said to be quasi-periodic with period T if it happens repeatedly and the time between successive occurrences is T on average. It is recommended that each inter-occurrence time be an independent random variable uniformly distributed between T/2 and 3T/2. Choosing confirmed peers ------------------------ Several actions described below require confirmed peers to be chosen for various purposes. Such choices are made as follows. First, if a particular confirmed set has not been specified then choose a confirmed set. Second, choose a peer from the designated set. If the specified set is empty, or if no set could be chosen because all sets are empty, then no peer can be chosen, and the action that required a confirmed peer cannot be performed. If a set needs to be chosen, start at the specified starting set (or at the top set if no starting set is specified), then flip a coin repeatedly and move down one set for each head, until a tail comes up. Then, if that set is empty, move up until a non-empty set is found; if you run past the top, move down until a non-empty set is found. When choosing a peer from a set, use a uniform random distribution unless some other distribution is explicitly invited. If the distribution to be used is non-random and no confirmed set has been specified (whenever this happens no starting set is specified either) then don't choose a set, but instead choose the peer from among the union of all the sets. Peer discovery -------------- Quasi-periodically send a ping to a peer. Choose the peer as follows: Let n be the number of nonempty unconfirmed sets. For the purposes of this activity, treat all sets beyond the first empty set as being part of last nonempty set (or of the first set, if it is empty). Flip a coin repeatedly, and if n heads come up before the first tails, choose a historic peer uniformly at random and send the ping to it (notice that if there are no unconfirmed peers, we are guaranteed to use a historic peer). Otherwise, choose an unconfirmed set uniformly at random from among the n nonempty sets (or cycle through them in a round-robin fashion), then choose a peer from that set uniformly at random, remove it from the set, and send the ping to it. [PARAMETER: Unconfirmed peer ping period, which is suggested to vary in proportion to 1/sqrt(max(n,1)).] [The use of historic peers allows probable recovery from network partitions.] When a ping arrives, reply with a pong containing two cookies (one newly created, and a second copied from the ping). When sending a pong, choose confirmed peers at random to form the list of other peers. For each selection, the starting set is determined as follows: From the set appropriate for the label of the pong-destination, flip coins and move up one set for each head until a tail comes up (or until the top is reached). [PARAMETER: How many peer selections should be performed for one pong? Tentative recommendation: 2.] When a pong arrives, use the copied cookie to validate the pong and obtain a round-trip time (see the message formats section for suggestions how to do this). If the pong is valid, add the peer to the appropriate confirmed set. Add the other peers to the appropriate unconfirmed sets. If the pong contains two cookies, reply with a pong containing one cookie (copied from the first cookie of the received pong). When a confirmed peer reaches its age limit, remove it from the set. Quasi-periodically choose a confirmed peer and add it to the list of historic peers. [PARAMETER: Historic peer addition period.] Hash table maintenance ---------------------- When a post message arrives, add each advertisement to the appropriate buffer, and add the location to the label's set in the map. But don't add the advertisement to the buffer if enough locations for this label are already known in the map. Only locations at least as complete *and* at least as new count toward inhibiting the addition. [PARAMETER: How many locations is enough?] For each local resource, quasi-periodically create an advertisement for it and process it as if it had arrived in a post message, except choose the buffer randomly instead of using the buffer appropriate for the label. Start with the top buffer, then flip coins and move down one buffer for each head until a tail comes up. [PARAMETER: Advertisement creation period.] When an advertisement buffer reaches its limit on the number of advertisements or the age of the oldest advertisement, remove all the advertisements and send them in a post message to a peer chosen uniformly at random from the confirmed set corresponding to the same bit position as the buffer. When a location is added to the map, old flags might change from off to on. Send cookieless indirect reports (see below) to the locations whose flags changed. There are two cases: If the location that was added gets flagged (because the set contains a newer timestamp), then send the report to that location informing it of all the newer locations. If locations already in the set get flagged (because they are older than the location that was added), then send one report to each location that got flagged (unless its node address matches the location that was added) informing it of the location that was added. Both cases cannot happen at once because of the invariant on the flags. Hash table lookups ------------------ To look up locations for a resource label, send queries to yourself repeatedly for as long as additional reports are desired. [PARAMETER: Query period, which should perhaps adapt to the average response time.] Reports can deliver additional locations, or already-learned locations, or indications that the resource appears not to exist. When a query arrives compare its resource label to your own node label and find the first bit position in which they differ. The confirmed peer set corresponding to that bit position is the "closer-set"; it contains all the confirmed peers who are "closer" to the resource label than you are. If that confirmed set is empty, or if any locations of the requested resource are known, send to the requestor an indirect report (see below) containing all known locations of the resource (even if there are none) and containing the cookie copied from the query. If the query arrived from yourself then choose a confirmed peer (and if using a non-random distribution then restrict the choice to the closer-set) and forward the query to that peer. If the query did not arrive from yourself and the number of known locations is not "enough" then forward the query to a peer chosen from the closer-set. Locations that are incomplete or flagged as old don't count for inhibiting the forwarding. [PARAMETER: How many locations is enough? Same as for advertisements.] The distribution for choosing the peer can be uniform random, or skewed to favor lower-cost nodes (but too much skewing can impair fault tolerance), or any distribution invited by options in the query message. The cost of a node might be proportional to its round-trip time and/or inversely proportional to the number of additional label bits it has in common with the requested label. Reports ------- When composing a report, the list of locations can be split across multiple report messages in order to bound the message size. When sending an indirect report, put the final destination in the message, choose a confirmed peer, and send the message to it. For reports sent in response to queries, the distribution may be skewed to favor peers with shorter round-trip times. But when the final destination is yourself, don't choose a peer at all, send it to yourself instead (to avoid unnecessary network usage and to allow a one-node community to function) (and in this case you can optionally bypass the indirection and send a direct report instead). When an indirect report (one containing a message address) arrives, remove the message address and forward the resulting report to that address. [Without the indirection it would be trivial for any node to discover other nodes whose labels are close to any given label. That ability would be much more useful to attackers than to legitimate users of the system.] When a direct report (one lacking a message address) arrives with a cookie, and the resource label is one of interest to the higher layer, deliver the locations up to the higher layer (or if the report contains no locations, deliver a non-definitive indication that the resource appears not to exist). When a direct report arrives without a cookie, compare its locations to the local resources. For any local resource whose label matches the report, and whose old flag is off, and whose timestamp is older than any of the timestamps in the report, inform the higher layer of the locations of the newer versions. Message formats =============== message ::= body options type group_identifier group_identifier ::= byte*N length_byte (N == length_byte) length_byte ::= byte type ::= byte options ::= byte* body ::= ping (type==1==001) | pong_one_cookie (type==2==010) | pong_two_cookies (type==3==011) | report_no_cookie (type==4==100) | report_cookie (type==5==101) | post (type==6==110) | query (type==7==111) ping ::= node_address message_address cookie pong_one_cookie ::= cookie node_address message_address connection_address node_list node_parameters node_list ::= length_byte node_message_address*N (N == length_byte) node_message_address ::= node_address message_address node_parameters ::= uptime bandwidth free_space pong_two_cookies ::= cookie pong_one_cookie report_no_cookie ::= destination label location_list report_cookie ::= destination cookie label location_list destination ::= direct | indirect direct ::= byte==0 indirect ::= byte==1 node_address message_address byte==0 location_list ::= length_byte location*N (N == length_byte) location ::= node_address message_address connection_address resource_parameters resource_parameters ::= timestamp_flag completeness post ::= length_byte advertisement*N (N == length_byte) advertisement ::= label location query ::= node_address message_address label cookie node_address ::= length_byte byte*N (N == length_byte) (A length_byte of 4 means IPv4. A length_byte of 16 means IPv6. Other address formats may be added in the future.) message_address ::= length_byte byte*N (N == length_byte) (A length_byte of 2 means UDP. Other address formats may be added in the future.) connection_address ::= length_byte byte*N (N == length_byte) (A length_byte of 2 means TCP. Other address formats may be added in the future.) label ::= byte*N (N is implied by the group identifier) cookie ::= length_byte byte*N (N == length_byte) (For pings, it is suggested that the cookie contain a timestamp and a keyed one-way hash covering the timestamp and the peer's node & message addresses. Then when the cookie comes back in a pong, both the timestamp and the peer's addresses can be verified. For queries, the cookie could contain a timestamp and a keyed one-way hash covering the timestamp and the label being sought. Then when the cookie comes back in a report, the timestamp can be verified and used to estimate the average response time. The inclusion of the peer address or label in the hash reduces (but does not eliminate) the vulnerability to replay attacks. The representations can be platform-specific, since the cookie is interpreted only by its creator.) uptime ::= usf (in seconds) bandwidth ::= usf (in bits per second) free_space ::= usf (in bytes) usf ::= byte*2 (Unsigned short float. For bytes B1 B2, the value is B2 if B1 is zero, otherwise (256 + B2) << (B1 - 1).) timestamp_flag ::= byte*5 (The most significant bit of the first byte is the old flag. The rest is a POSIX timestamp, in network byte order (big-endian, most significant first).) completeness ::= byte (A value of N means that at least the first N/255 of the resource is present. A value of zero means that a certain minimal amount is present, specified by the upper layer.) Message options =============== Options are defined for query and report messages; for other message types, options must not be sent, and must be ignored if received. For report messages, only the first option byte is defined; any more must not be sent, and must be ignored if received. Within this byte, the least significant seven bits are reserved, and must be sent as zeroes, and ignored on receipt. If the most significant bit is set, it indicates an exact match between the label in the message and the label of its sender (the original sender, not the relayer). This option can be useful for testing the reachability of a particular node: do a lookup on the node's label and watch for a report with the option bit set. For query messages, only the first option byte is defined; any more must not be sent, and must be ignored if received. Within this byte, the least significant six bits are reserved, and must be sent as zeros, and ignored on receipt. The two most significant bits provide a forwarding hint, indicating the kind of distribution to be used when choosing the next hop for the query and for any indirect reports evoked by the query. 00: any random distribution designed to reduce the total time 01: any random distribution designed to reduce the total hop count 10: uniform random 11: any deterministic choice designed to minimize the total time If the option is not present, the default forwarding hint is 00. The hint should not be changed when the query is relayed. Nodes are not required to respect the hint, but note that non-random next-hops are not allowed except when hint 11 is present. Implementations are encouraged to set and respect the defined options, but cannot assume that their peers do so. Group identifiers ================= Group identifiers with length 0 to 4 are reserved. Group identifiers with length greater than 4 are available for private use. They are not registered, so collisions are possible, but the choice of identifier can make them arbitrarily unlikely.