环境

  • Red Hat Enterprise Linux
  • Network connection known performance characteristics, such as a WAN connection with high throughput and high latency, but does not necessarily have to be a WAN
  • TCP networking

问题

  • We are transmitting a large amount of data over a WAN link and wish to tune the transfer to be as fast as possible
  • We are trying to backhaul traffic from one datacenter to another and appear to be running into small TCP window sizes that are impacting the transmission speed
  • We are looking for sysctl parameters which can be tuned to allow better performance over our network
  • How do I accurately tune TCP socket buffer sizes?

决议

What you are looking to calculate is called the ​​Bandwidth Delay Product​​. This is the product of a link's capacity and latency, ie. how many bits can actually be on the wire at a given time.

Socket Buffers

Once this is calculated, you should then tune your network buffers to accommodate this amount of traffic, plus some extra. Tune write buffers on the sender (​​net.core.wmem_max​​, ​​net.core.wmem_default​​, ​​net.ipv4.tcp_wmem​​, ​​net.ipv4.tcp_mem​​) and read buffers on the receiver (​​net.core.rmem_max​​, ​​net.core.rmem_default​​, ​​net.ipv4.tcp_rmem​​, ​​net.ipv4.tcp_mem​​).

We make these buffer changes by entering lines into ​​/etc/sysctl.conf​​ file and running:

​Raw​

[root@host]# sysctl -p

You may also choose to make the change temporarily by running:

​Raw​

[root@host]# sysctl -w key.name=value

For example:

​Raw​

[root@host]# sysctl -w net.core.rmem_max=16777216

To prepare for this, we will increase the maximum amount of memory available for network sockets:

​Raw​

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.ipv4.tcp_mem = 8388608 12582912 16777216

We will also need to make sure that TCP Window Scaling is on. Run the command:

​Raw​

[root@host]# sysctl -a | grep window_scaling

and ensure the returned result is:

​Raw​

[root@host]# sysctl -a | grep window_scaling
net.ipv4.tcp_window_scaling = 1

If window scaling is set to 0, change it to 1.

Your action here will be to tune socket buffers. After you make each change, test the speed to a client and record the result.

The tunables here are:

​Raw​

net.ipv4.tcp_rmem = 8192 x 4194304
net.ipv4.tcp_wmem = 8192 Y 4194304

The first value is the smallest buffer size. We recommend keeping this at 8192 bytes so that it can hold two memory pages of data.

The last value is the largest buffer size. You could probably safely leave this at 4194304 (4Mb)

The middle value Y is the default buffer size. This is the most important value. You might wish to start at 524288 (512kb) and move up from there. You will generally wish to try small increments of your Bandwidth Delay Product. Try BDP x1 then BDP x1.25 then BDP x1.5 and so on. Once you start to get increased speeds, you may wish to refine your testing down smaller, for example BDP x2.5 then BDP x2.6 and so on. It is unlikely you will need a value larger than BDP x5.

Unfortunately there's no "silver bullet" buffer value which is perfect, or can be calculated definitely. Each individual link will be different and requires individual testing to attain the best throughput.

We expect you will come up with a table of results something like:

​Raw​

---------------------------------------------------
Buffer | BDP*1.5 | BDP*1.75 | BDP*2 | ... and so on
Size | (1312500) | (1531250 | (1750000) | ...
--------------------------------------------------
Client | A kbps | B kbps | C kbps | ...
Speed | | | | ...
--------------------------------------------------

This will help you to see the best buffer size for the best client speed. You would then set this buffer permanently.

TCP Window

If you run a packet capture at the same time, you'll see TCP window size grow. Window size will never hit the maximum value of your buffers, as TCP consumes some bandwidth as overhead to automatically tune the window.

TCP does not use the full window size instantly when a new connection is established. Rather, TCP uses a "slow start" algorithm which gradually increases amount of data sent as the connection life progresses. The purpose of this gradual increase is to allow TCP to calculate how much data can be sent over the network without dropping packets.

By default, TCP will reset this calculation (called the congestion window) back to its initial value after a period of idle time equal to the Round Trip Timeout between the two hosts (i.e. double the network latency). This can significantly reduce the transfer data rate of a connection if there are idle periods of application processing.

If the application performs a "data transmit" period, then stops data transmit for a "processing" period and the processing period is greater than double the WAN latency, there may be some advantage to disabling the slow start algorithm for established connection. This is done by changing the ​​net.ipv4.tcp_slow_start_after_idle​​ tunable:

​Raw​

net.ipv4.tcp_slow_start_after_idle = 0

Testing

If you're using file copies as tests, ensure you drop caches between tests (​​echo 3 > /proc/sys/vm/drop_caches​​). We would suggest using direct I/O or some other method to bypass caches (such as ​​dd conv=fsync​​) so that cached data does not produce artificial test results.

You are always better to test with your actual production workload, or a simulation of the production workload crafted with a tool such as ​​iozone​​. Tuning a system for artificial bulk transfer benchmarks when your application sends small amounts of data and requires low latency, or vice versa, will only result in incorrect tuning and will hurt overall application performance.