SPEAKER NOTES
                 TUNING HOSTS FOR NETWORK PERFORMANCE
                             Glen Turner
                           2008-01-29 1055
                  Sysadmin miniconf of linux.conf.au
        Theatre A, Old Arts Building, University of Melbourne


SOFT INTRO

I promise
 - No OSI seven-layer protocol model
 - No protocol header diagrams


---- UP TO HERE ----


TCP TUNING

INTRODUCTION

To date most factors inhibiting end-to-end network performance have
been in the network.  The reason a network engineer is speaking at a
sysaddmin miniconf is that the performance issues have now moved into
computers and their operating systems.


BUGS

TSO

Some hardware.

Ensure your hardware choice has a NAPI driver.


The Linux networking implementation is the best there is, and
researchers in the field have moved from BSD UNIX to Linux, so Linux
is better earlier.

But these two factors place more pressure on the Linux networking
implementation. No one uses Windows Xp to transfer files at 10Gbps,
and any who does have such a machine doesn't expect it to work. Linux
is expected to work.


MOTIVATION

Why is a network engineer at a systems administration miniconf?

The performance of well-designed networks has reached a peak.

1. Bandwidth is either cheap; or
   nearly unobtainable extortionately priced.

2. ASIC-based routers forward packets with no unavoidable latency, and
   thus low jitter.

3. The remaining problems we cannot fix with engineering:

   a. Extortionate pricing is due to political or market competition factors

   b. Reducing latency requires either
        increasing the speed of light in fiber, or
        reducing the diameter of the globe.
      Light in fiber increases by 1% per year,
      Earth's diameter reduces by about 1mm per year.

But the average user can't fill a 1Gbps link across or out of Australia.

The reasons for this are found in
  applications
  operating systems
  fundamental algorithms
  host hardware


PART 1. FUNDAMENTAL ALGORITHMS


TCP -- TRANSMISSION CONTROL PROTOCOL

[screen shows packet capture with TCP details exploded]

The user's view is that TCP implements a "connection", providing
multiplexing, reliable and in-order delivery, and flow control between
the two host's applications.

The network designers's view is that TCP provides cooperative sharing
of link bandwidths and avoids the congestion collapse of the Internet.

TCP builds on IP, whos job is to get packets from one host to another
through the network.  Applications build on TCP, such HTTP for web
browsing. In 1984 we would have called TCP the "transport layer".

The strength of TCP is that it implements a lot of services using
one mechanism -- the "window".


TCP'S WINDOW

Every byte has a sequence number.

The sender tracks bytes send and bytes acked, holding un-acked bytes
in a buffer.

The receiver buffers incoming segments. It Acks every second segment.
The receiver implements flow control by lowering the advertised window
in its Acks as receiver buffer is consumed. If the application stops
reading the buffer will fill, the advertised window will drop to zero,
and the sender will stall.

If the sender does not see an Ack in within the expected variability
of Acks then it is assumed that the sent packets were discarded due to
a congested link.

The "expected variability" is determined by using the returning Acks
to maintain an estimate and variance of the round-trip time.

The amount of data to re-send must be less than the advertised window,
since that amount of data caused congestion. So the sender maintains a
"congestion window" -- an estimate of the number of bytes which can be
sent without causing congestion (ie, losing packets).

When we have a late or duplicate Ack the sender enters "congestion
avoidance" mode, which linearly approaches the expected congestion
bandwidth.


SLOW START

We don't want to cause congestion collapse when starting a
connection. We have no estimate of available bandwidth and this limits
how much data we can safely transmit.

At the same time, a linear approach to the congestion bandwidth might
take a very long time. So slow start does an exponential approach, a
backoff when congestion is experienced, and then a linear approach.

congestion window = two segments (twice the MSS advertised by the receiver)
transmit min(cwnd, receiver's advertised window)
For each Ack
  increase the cwnd by one segment
    [this doubles the window size per round-trip time]

When an Ack is too late, we know that cwnd congestion has occurred and
cwnd was increased too much. The "slow start threshold" is set to half
the cwnd. We resume slow start from the previous cwnd until ssthresh
is reached, then we enter congestion avoidance mode.

CONGESTION AVOIDANCE

Increment cwnd by one cwnd per round trip time, which gives a linear
approach to the suspected congestion bandwidth.

If an Ack is late, then reduce cwnd to one segment and re-enter slow
start. An improvement is to reduce cwnd to ssthresh instead of
immediately dropping all the way to one segment.


OUT OF ORDER SEGMENTS (RENO)

This algorithm is very sensitive to re-ordered packets. So rather than
immediately lower cwnd on an out-of-order Ack we wait. Hopefully we
will get an Ack from the receiver which incorporates the late packet.

When the third duplicate Ack arrives we know the late packet has been
lost and enter congestion avoidance.


HOST BUFFER SIZING

Both the sender and receiver need to buffer data. The sender keeps
un-acked data, the receiver keeps received data which is yet to be
delivered to the application.

Both buffer sizes work out to be the bandwidth-delay product, with
the sender's buffer being critical.

Note that the worst case of the BDP is easy to compute, since the
bandwidth of the interface is obviously the maximum and a guess at
the maximum delay can be verified using ping.


PART 2. OPERATING SYSTEMS


BUFFER SIZING IN LINUX

Linux makes one change to the TCP algorithm: it caches ssthresh
between connections. For testing this can be disabled with

  net.ipv4.tcp_no_metrics_save = 1

The kernel attempts to autotune the buffer size, up to 4MB.  If a
quick calculation shows an expected BDP of less than 4MB you need do
nothing. This is the case for ADSL and 802.1g in Australia.

For GbE interfaces you need to munge
  net.ipv4.tcp_rmem
  net.ipv4.tco_wmem
these are <minimum, initial, maximum> vectors of actual
memory usage (not packets, so inflate the BDP to cover
for kernel data structures).

You need to watch the initial value, as it can be used to leverage a
DoS attack if set high.


APPLICATION BUFFER SIZING

An application can use setsockopt(..., SO_SENDBUF, ...) and
setsockopt(..., SO_RCVBUF, ...) to explicitly set buffer sizes.  The
requested value is trimmed by

  net.core.rmem_max
  net.core.wmem_max

Using these system calls disabled autotuning for the connection.

So iperf, which always sets a buffer size, doesn't reflect Linux
networking performance.


DISTRIBUTIONS

Some distributions disable useful TCP features, turn them back on.

  net.ipv4.tcp_timestamps = 1
  net.ipv4.tcp_window_scaling = 1
  net.ipv4.tcp_sack = 1
  net.ipv4.tcp_ecn = 1
  net.ipv4.tcp_syncookies = 1
  net.ipv4.tcp_moderate_rcvbuf = 1
  net.ipv4.tcp_adv_win_scale = 7

ECN and window scaling expose bugs in some network equipment.


TCP ALGORITHM VARIATIONS

There are a lot of modifications to the basic TCP
algorithm. Unfortunately, there is also a large degree of academic
hype, since there is a lot of competition for academics to have their
algorithm chosen as the next generation TCP algorithm.

The algorithms try to achieve:
 - shorter slow start time
 - less "hunting" around the congestion bandwidth
 - greater resilience to non-congestion loss

Linux has pluggable TCP kernel modules, the interesting one's are:
 - CUBIC, the Linux default, this features a nice slow start
   algorithm.
 - Westwood+, for wireless networks with link losses
 - Hamilton TCP, nice fairness (fairness is important if you want
   multiple connections with similar performance)

Now for the bad news. The TCP variant is set by the *sender* and can't
be negotiated.  So for the typical case of a web server sending to a
WLAN ADSL router it is the web server which needs to use Westwood for
there to be any effect.


MTU -- MAXIMUM TRANSMISSION UNIT

The largest packet size passed by a path, measured in bytes of link
layer payload.

The larger the packet, the less overhead for the operating system in
dealing with packets.  This was the original motivation for a MTU size
of 9000 bytes for gigabit ethernet jumbo frames and 64KB for 10GE
super jumbo frames.

There is also a theoretical consideration above 1Gbps: TCP cannot
fill a long link at a MTU size of 1500 bytes.


LOW MEMORY

TCP and disk buffers both use low memory. If you run both hard then
low memory fragments and the kernel dies. This happens all the time on
our networked backup server, usually around the 12TB mark.

There is finally work happening in the kernel to address this.
The ability to force such work upon the vendors is a major reason
for support, so you might want to read you support contract through
that lens. Contracts which list events seem to leave a lot of wriggle room
for network performance.

An easy way to get more low memory is to run a 64-bit kernel.

2.4.24 has merged anti-fragmentation patches.


VIRTUALISATION

Dont


IPTABLES

One of the major advances in TCP throughput was when Van recognised
that buffer copying was consuming a fair part of the time taken to
process a packet.

Unfortunately IPtables does a fair bit of buffer copying.

Turn it off if you dare.  But note that compromised Linux machines are
the major source of attacks against network infrastructure and vendors
(ie, the nasty people steal accounts from Windows servers, but then
prefer the stability of Linux for doing their evil work).


INSTRUMENTATION

WIRESHARK

A packet sniffer, and a good one.

Use a passive optical tap if you are worried about the effect of
running Wireshark on the platform you are monitoring.


KERNEL NETLINK TCP STATE CHANGES

A new API.


WEB100

Instruments everything. Some very nice GUI tools for pointing the finger
to which component is slowing things down.

Web100 servers.


PART 3. APPLICATIONS

SSH

Ssh implements its own windowing. This has inadequate window
size. This has been fixed in recent ssh.

Ssh places a lot of load on a machine by encrypting the payload. Most
science data is sensor white noise, and the file transfer is to a
supercomputer which can find the small signal in the output of this
maxed-out sensor. Encryption isn't necessary, it has already been done
by nature.


LATENCY

We can always improve bandwidth: by replacing active components.

Improving latency required new fibre paths. Creating a more direct
path takes a lot of time and effort and may not be financially possible.

If there is already a direct-ish path then the speed of light in fiber
improves by about 10% per decade.

Applications are incredibly wasteful of round-trips, and thus consume
heaps of latency, which is the fundementally limited resource.

Examples
 - CSS stylesheets
 - HTML images
 - extensive negotiations on protocol start


NFS AND DELAYED ACKS

Every two packets causes an Ack clock. But a 8KB disk block uses three
packets, so it every disk block hits a delayed Ack.


PART 4. NETWORKS


LOSS

Loss is an indicator of congestion.

But in some media it is also a fact of life.

Reduce loss to the miniumum possible

 - use wired networks rather than wireless
 - record CRC errors on interfaces
 - use ethernet auto-negotiation, as manual configuration is misunderstood.


FIREWALLS

The claims of firewall vendors are slightly more believable than those
of anti-virus scanners.

Most firewalls are PCs running Linux, so they simply move the problem
to a less transparent platform.

Firewalls traditionally have TCP-effecting bugs
 - SACK corruption in PIX (Cisco have fixed this, Ilpo Järvinen's
   fix in 2.6.24 should lead to it not locking up all connections)
 - discarding packets with ECN

So firewall code needs to be kept up to date, but network security
officers are oddly resistant to keeping their firewall code up to
date.


ASYMETRIC PATHS

Timer effects

Congesting the returning Acks -- this often happens on ADSL links. If you
run ADSL2+ at maximum rate you want Annex M to increase the capacity available
for Acks.


SOURCES OF NON-SIGNALLING PACKET LOSS

PATHS

Some paths are bad news.

Satelite

Undersea, older is worse.

DWDM - theres a fundamental opposition between bandwidth and loss.  We want
less bandwidth and less loss. The network operator gets money for bandwidth,
so they want more bandwidth, whatever the effect on goodput.


PART 5. HARDWARE


VALIDATION

Largely a myth

A lot of faulty hardware out there.

Difficult even to select hardware which should work well
 - eg, which ethernet cards support NAPI, jumbos, checksumming, TSO, LRO.
 - and do you choose the ethernet card or do you live
   with the motherboard manufacturers choice (ie, the
   cheapest)


TCP OFFLOAD

There's a certain point in the hardware development life where it
makes sense, outside of that temporary situation it is a nightmare.

Imagine yourself as the proud owner of a 1Gbps TCP offload card today.


10GE equipment should support a 64KB MTU.

There are no standard jumbo frames, so be careful when you buy.


PART 6. LINUX AS A ROUTER

ROUTER BUFFER SIZING