TCP congestion control (RFC 2581), continued: TCP-Reno (1990) included fast retransmit and fast recovery (also developed by Van Jacobson). fast retransmit: Interpret three duplicate acks (dupacks) as an early indication of loss (before the timeout) (duplicate acks could also be caused by reordering, or duplication by the network). Retransmit the packet and set the slow-start threshhold to half the amount of unacked data, just as would be done after a timeout, but don't do slow start, just set the congestion window to the threshhold. fast recovery: After a fast retransmit, as long as the acks are duplicates, inflate the congestion window by one packet for each dupack (because each the dupack presumably means that a packet has left the network). Once the missing data is acknowledged, deflate the congestion window back to the threshhold. These work well for one loss within a window, but not for more. NewReno (RFC 2582) interprets partial acks (that acknowledge the original retransmission, but not all the data sent before it) as indications of more losses, and sends more fast retransmissions. SACK (selective acknowledgement) (RFC 2018) is an extension to the TCP header that lets the receiver say exactly what has arrived and what hasn't. This gets quite complicated. TCP-Vegas (not recommended) tries to use a constant amount of space in router buffers, rather than divide the entire available amount among contenders. It compares the round-trip time to the minimum round-trip time, to infer the queue time, which it tries to keep small. One potential problem is that the source might never see an undelayed packet. Also, it can't compete against Tahoe or Reno, so who would be willing to use it? Ack clock: This is only tangentially related to congestion control. In any sliding window protocol, the source typically sends a new data packet whenever it receives an ack. Therefore, even if the first link is much faster than the slowest link on the path, making it possible for the source to send its data in bursts, the data packets will become spread out in time on the slow link, causing the acks to be spread out, causing the subsequent data packets to be spread out. Congestion is less severe when data flows are steady than when they are bursty. Fairness: When network resources are shared by multiple TCP connections, there is a question of whether each connection gets a fair share of the resources. bandwidth vs. storage: Users generally care about how much bandwidth they get, so one definition of fairness would be that each connection should get the same bandwidth. On the other hand, a path of ten 10 Mbps links can support either a single 10 Mbps spanning all ten links, or ten 10 Mbps connections each spanning one link. This suggests another notion of what are the resources that are being shared: storage in the network (on links and in queues). If storage is divided equally, then each TCP connection should get the same bandwidth-delay product, which is roughly proportional to the window size. (The window size is the product of bandwidth and round-trip time, rather than one-way delay.) There is no consensus on what fairness should mean. TCP's congestion control algorithms tend to give roughly equal windows to competing connections. additive increase multiplicative decrease: TCP's congestion control increases the window additively (adding one packet at a time in the absence of loss), and decreases it multiplicatively (cutting it in half when a loss is detected). This means that if the resources are allocated unfairly to start with, all connections increase their allocation at the same rate, but a connection with a larger share will decrease more quickly than one with a smaller share. Therefore an additive increase multiplicative decrease policy tends to make the allocation more fair over time. different round-trip times: Actually, if competing TCP connections have different round-trip times (RTT), one with a larger RTT will increase its window more slowly, because they all add one packet per RTT. On the other hand, a connection with a larger window is more likely to experience a loss (because it has more packets in the network that could be lost). Different RTTs complicates analysis of TCP, but to a first-order, competing connections get roughly equal window sizes on average. RTT estimation: If TCP's retransmission timeout is too small, a slow packet will be misinterpreted as being lost, and an unnecessary retransmission will be sent. Before the introduction of congestion control in 1988, this was a minor annoyance--if 5% of the packets were mistakenly thought to be lost, then the source would send at 105% of the rate it needed to. But with congestion control, every window of packets that experiences at least one loss causes the congestion window to be halved. This will cause a connection that mistakenly thinks 5% of its packets are lost to have an average window size of about four packets, even if the network is uncongested. (For a packet loss probability of p, the average window size will be roughly sqrt(1/p) packets.) The best predictor of future round-trip times is past round-trip times, so the addition of congestion control was accompanied by an improvement in round-trip time estimation. The old method used an exponentially weighted moving average to estimate only the mean RTT. Each ack provides one sample RTT, which is used to update the estimated mean, which is used to set the retransmission timeout: error = sampleRTT - mean mean += a * error /* 0 <= a <= 1 */ timeout = d * mean /* d >= 1 */ The rationale behind exponentially weighted moving averages is that they are easy to compute (as shown above), they approach the true mean (because the error is more likely to push the estimate toward the true mean than away from it), and they track changes in the true mean (because more recent samples contribute more to the estimate than older samples). RFC 793 suggested values of 0.1 to 0.2 for a, and 1.3 to 2.0 for d. The problem with estimating only the mean is that if d is too small, the random variation in the round-trip time may cause a significant number of packets to be mistakenly thought to be lost, while if d is too large then retransmissions will be delayed unnecessarily. The 1988 improvement estimates the variation in round-trip time as well as the mean. Technically, it estimates the "mean deviation", which is the average absolute difference between the samples and the mean. The mean deviation is estimated using an exponential weighted moving average just like the one used to estimate the mean: error = sampleRTT - mean mean += a * error /* 0 <= a <= 1 */ sample_dev = |error| dev_error = sample_dev - mean_dev mean_dev += b * (sample_dev - mean_dev) /* 0 <= b <= 1 */ timeout = mean + c * mean_dev /* c >= 0 */ The usual values for the constants are a = 1/8, b = 1/4, c = 4. In practice this does a good job of adapting to the round-trip times of the channel, setting timeouts that are long enough but not much too long. Congestion control challenges: In recent years, amount of UDP traffic has increased. It used to be used mainly for DNS, but now it is also used for real-time audio and video, since the reliability and delays of TCP are not suited to such applications. Since UDP currently has no congestion control, this poses a risk of congestion collapse if the majority of traffic on the Internet becomes UDP traffic. There is research being done to factor out the congestion control from TCP into a separate layer, so that it could be used by both UDP and TCP. Another risk comes from TCP implementations that, either accidentally or intentionally, do not back off as much as they are supposed to. If too many TCP connections try to get more than their fair share of the network, we will have congestion collapse. One proposed solution is to add mechanisms to routers to detect high-bandwidth sources that don't respond to congestion signals, and put their connections into "penalty boxes", so their packets get lower quality service than normal packets. This would provide an incentive to implementors to do congestion control properly. One disadvantage of TCP's congestion control is that it always tries to keep the router queues full. A proposed method for keeping the queues mostly empty is called Random Early Detection (RED). Whenever a packet arrives, the router could randomly decide whether to drop a packet from the queue, with a higher probability if the queue is more full. At low loads, the queue would usually be nearly empty, so very few packets would be dropped, while at higher loads, enough packets would be dropped to cause the sources to back off, without requiring the queue to fill up. The packet to be dropped would not necessarily be the one that just arrived, but would be selected at random, so that a source sending at a higher rate (one with more packets in the queue) is more likely to be signalled to slow down. If the packet to be dropped has the ECN-capable bit set, then the router can set the congestion-experienced bit rather than drop it (see the previous lecture notes). With the growth of the web, more and more traffic takes the form of short-lived connections. Often the entire amount of data to be sent fits on the network path (the bandwidth-delay product), meaning it could be transferred in one round-trip time. But the source has no way of knowing the capacity of the network beforehand, so to avoid congestion it gradually increases it window from one packet over several round-trip times. This is a frustrating underuse of the network, and is a hard problem.