SPEAKER NOTES TUNING HOSTS FOR NETWORK PERFORMANCE Glen Turner 2008-01-29 1055 Sysadmin miniconf of linux.conf.au Theatre A, Old Arts Building, University of Melbourne SOFT INTRO I promise - No OSI seven-layer protocol model - No protocol header diagrams ---- UP TO HERE ---- TCP TUNING INTRODUCTION To date most factors inhibiting end-to-end network performance have been in the network. The reason a network engineer is speaking at a sysaddmin miniconf is that the performance issues have now moved into computers and their operating systems. BUGS TSO Some hardware. Ensure your hardware choice has a NAPI driver. The Linux networking implementation is the best there is, and researchers in the field have moved from BSD UNIX to Linux, so Linux is better earlier. But these two factors place more pressure on the Linux networking implementation. No one uses Windows Xp to transfer files at 10Gbps, and any who does have such a machine doesn't expect it to work. Linux is expected to work. MOTIVATION Why is a network engineer at a systems administration miniconf? The performance of well-designed networks has reached a peak. 1. Bandwidth is either cheap; or nearly unobtainable extortionately priced. 2. ASIC-based routers forward packets with no unavoidable latency, and thus low jitter. 3. The remaining problems we cannot fix with engineering: a. Extortionate pricing is due to political or market competition factors b. Reducing latency requires either increasing the speed of light in fiber, or reducing the diameter of the globe. Light in fiber increases by 1% per year, Earth's diameter reduces by about 1mm per year. But the average user can't fill a 1Gbps link across or out of Australia. The reasons for this are found in applications operating systems fundamental algorithms host hardware PART 1. FUNDAMENTAL ALGORITHMS TCP -- TRANSMISSION CONTROL PROTOCOL [screen shows packet capture with TCP details exploded] The user's view is that TCP implements a "connection", providing multiplexing, reliable and in-order delivery, and flow control between the two host's applications. The network designers's view is that TCP provides cooperative sharing of link bandwidths and avoids the congestion collapse of the Internet. TCP builds on IP, whos job is to get packets from one host to another through the network. Applications build on TCP, such HTTP for web browsing. In 1984 we would have called TCP the "transport layer". The strength of TCP is that it implements a lot of services using one mechanism -- the "window". TCP'S WINDOW Every byte has a sequence number. The sender tracks bytes send and bytes acked, holding un-acked bytes in a buffer. The receiver buffers incoming segments. It Acks every second segment. The receiver implements flow control by lowering the advertised window in its Acks as receiver buffer is consumed. If the application stops reading the buffer will fill, the advertised window will drop to zero, and the sender will stall. If the sender does not see an Ack in within the expected variability of Acks then it is assumed that the sent packets were discarded due to a congested link. The "expected variability" is determined by using the returning Acks to maintain an estimate and variance of the round-trip time. The amount of data to re-send must be less than the advertised window, since that amount of data caused congestion. So the sender maintains a "congestion window" -- an estimate of the number of bytes which can be sent without causing congestion (ie, losing packets). When we have a late or duplicate Ack the sender enters "congestion avoidance" mode, which linearly approaches the expected congestion bandwidth. SLOW START We don't want to cause congestion collapse when starting a connection. We have no estimate of available bandwidth and this limits how much data we can safely transmit. At the same time, a linear approach to the congestion bandwidth might take a very long time. So slow start does an exponential approach, a backoff when congestion is experienced, and then a linear approach. congestion window = two segments (twice the MSS advertised by the receiver) transmit min(cwnd, receiver's advertised window) For each Ack increase the cwnd by one segment [this doubles the window size per round-trip time] When an Ack is too late, we know that cwnd congestion has occurred and cwnd was increased too much. The "slow start threshold" is set to half the cwnd. We resume slow start from the previous cwnd until ssthresh is reached, then we enter congestion avoidance mode. CONGESTION AVOIDANCE Increment cwnd by one cwnd per round trip time, which gives a linear approach to the suspected congestion bandwidth. If an Ack is late, then reduce cwnd to one segment and re-enter slow start. An improvement is to reduce cwnd to ssthresh instead of immediately dropping all the way to one segment. OUT OF ORDER SEGMENTS (RENO) This algorithm is very sensitive to re-ordered packets. So rather than immediately lower cwnd on an out-of-order Ack we wait. Hopefully we will get an Ack from the receiver which incorporates the late packet. When the third duplicate Ack arrives we know the late packet has been lost and enter congestion avoidance. HOST BUFFER SIZING Both the sender and receiver need to buffer data. The sender keeps un-acked data, the receiver keeps received data which is yet to be delivered to the application. Both buffer sizes work out to be the bandwidth-delay product, with the sender's buffer being critical. Note that the worst case of the BDP is easy to compute, since the bandwidth of the interface is obviously the maximum and a guess at the maximum delay can be verified using ping. PART 2. OPERATING SYSTEMS BUFFER SIZING IN LINUX Linux makes one change to the TCP algorithm: it caches ssthresh between connections. For testing this can be disabled with net.ipv4.tcp_no_metrics_save = 1 The kernel attempts to autotune the buffer size, up to 4MB. If a quick calculation shows an expected BDP of less than 4MB you need do nothing. This is the case for ADSL and 802.1g in Australia. For GbE interfaces you need to munge net.ipv4.tcp_rmem net.ipv4.tco_wmem these are vectors of actual memory usage (not packets, so inflate the BDP to cover for kernel data structures). You need to watch the initial value, as it can be used to leverage a DoS attack if set high. APPLICATION BUFFER SIZING An application can use setsockopt(..., SO_SENDBUF, ...) and setsockopt(..., SO_RCVBUF, ...) to explicitly set buffer sizes. The requested value is trimmed by net.core.rmem_max net.core.wmem_max Using these system calls disabled autotuning for the connection. So iperf, which always sets a buffer size, doesn't reflect Linux networking performance. DISTRIBUTIONS Some distributions disable useful TCP features, turn them back on. net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_ecn = 1 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_moderate_rcvbuf = 1 net.ipv4.tcp_adv_win_scale = 7 ECN and window scaling expose bugs in some network equipment. TCP ALGORITHM VARIATIONS There are a lot of modifications to the basic TCP algorithm. Unfortunately, there is also a large degree of academic hype, since there is a lot of competition for academics to have their algorithm chosen as the next generation TCP algorithm. The algorithms try to achieve: - shorter slow start time - less "hunting" around the congestion bandwidth - greater resilience to non-congestion loss Linux has pluggable TCP kernel modules, the interesting one's are: - CUBIC, the Linux default, this features a nice slow start algorithm. - Westwood+, for wireless networks with link losses - Hamilton TCP, nice fairness (fairness is important if you want multiple connections with similar performance) Now for the bad news. The TCP variant is set by the *sender* and can't be negotiated. So for the typical case of a web server sending to a WLAN ADSL router it is the web server which needs to use Westwood for there to be any effect. MTU -- MAXIMUM TRANSMISSION UNIT The largest packet size passed by a path, measured in bytes of link layer payload. The larger the packet, the less overhead for the operating system in dealing with packets. This was the original motivation for a MTU size of 9000 bytes for gigabit ethernet jumbo frames and 64KB for 10GE super jumbo frames. There is also a theoretical consideration above 1Gbps: TCP cannot fill a long link at a MTU size of 1500 bytes. LOW MEMORY TCP and disk buffers both use low memory. If you run both hard then low memory fragments and the kernel dies. This happens all the time on our networked backup server, usually around the 12TB mark. There is finally work happening in the kernel to address this. The ability to force such work upon the vendors is a major reason for support, so you might want to read you support contract through that lens. Contracts which list events seem to leave a lot of wriggle room for network performance. An easy way to get more low memory is to run a 64-bit kernel. 2.4.24 has merged anti-fragmentation patches. VIRTUALISATION Dont IPTABLES One of the major advances in TCP throughput was when Van recognised that buffer copying was consuming a fair part of the time taken to process a packet. Unfortunately IPtables does a fair bit of buffer copying. Turn it off if you dare. But note that compromised Linux machines are the major source of attacks against network infrastructure and vendors (ie, the nasty people steal accounts from Windows servers, but then prefer the stability of Linux for doing their evil work). INSTRUMENTATION WIRESHARK A packet sniffer, and a good one. Use a passive optical tap if you are worried about the effect of running Wireshark on the platform you are monitoring. KERNEL NETLINK TCP STATE CHANGES A new API. WEB100 Instruments everything. Some very nice GUI tools for pointing the finger to which component is slowing things down. Web100 servers. PART 3. APPLICATIONS SSH Ssh implements its own windowing. This has inadequate window size. This has been fixed in recent ssh. Ssh places a lot of load on a machine by encrypting the payload. Most science data is sensor white noise, and the file transfer is to a supercomputer which can find the small signal in the output of this maxed-out sensor. Encryption isn't necessary, it has already been done by nature. LATENCY We can always improve bandwidth: by replacing active components. Improving latency required new fibre paths. Creating a more direct path takes a lot of time and effort and may not be financially possible. If there is already a direct-ish path then the speed of light in fiber improves by about 10% per decade. Applications are incredibly wasteful of round-trips, and thus consume heaps of latency, which is the fundementally limited resource. Examples - CSS stylesheets - HTML images - extensive negotiations on protocol start NFS AND DELAYED ACKS Every two packets causes an Ack clock. But a 8KB disk block uses three packets, so it every disk block hits a delayed Ack. PART 4. NETWORKS LOSS Loss is an indicator of congestion. But in some media it is also a fact of life. Reduce loss to the miniumum possible - use wired networks rather than wireless - record CRC errors on interfaces - use ethernet auto-negotiation, as manual configuration is misunderstood. FIREWALLS The claims of firewall vendors are slightly more believable than those of anti-virus scanners. Most firewalls are PCs running Linux, so they simply move the problem to a less transparent platform. Firewalls traditionally have TCP-effecting bugs - SACK corruption in PIX (Cisco have fixed this, Ilpo Järvinen's fix in 2.6.24 should lead to it not locking up all connections) - discarding packets with ECN So firewall code needs to be kept up to date, but network security officers are oddly resistant to keeping their firewall code up to date. ASYMETRIC PATHS Timer effects Congesting the returning Acks -- this often happens on ADSL links. If you run ADSL2+ at maximum rate you want Annex M to increase the capacity available for Acks. SOURCES OF NON-SIGNALLING PACKET LOSS PATHS Some paths are bad news. Satelite Undersea, older is worse. DWDM - theres a fundamental opposition between bandwidth and loss. We want less bandwidth and less loss. The network operator gets money for bandwidth, so they want more bandwidth, whatever the effect on goodput. PART 5. HARDWARE VALIDATION Largely a myth A lot of faulty hardware out there. Difficult even to select hardware which should work well - eg, which ethernet cards support NAPI, jumbos, checksumming, TSO, LRO. - and do you choose the ethernet card or do you live with the motherboard manufacturers choice (ie, the cheapest) TCP OFFLOAD There's a certain point in the hardware development life where it makes sense, outside of that temporary situation it is a nightmare. Imagine yourself as the proud owner of a 1Gbps TCP offload card today. 10GE equipment should support a 64KB MTU. There are no standard jumbo frames, so be careful when you buy. PART 6. LINUX AS A ROUTER ROUTER BUFFER SIZING