Why upgrading your Linux Kernel will make your customers much happier

about 12 years ago

Sometimes we hear that crazy developer talk about some magical thing you can do that will increase performance everywhere by 30% (feel free to replace that percentage with whatever sits right for you).

In the past week or so I have been playing the role of “that guy”. The ranting lunatic. Some times this crazy guy throws all sorts of other terms around that make them sound more crazy. Like say: TCP or Slow Start or Latency … and so on.

So we ignore that guy. He is clearly crazy.

##Why the web is slow?

Turns out that when it comes to a fast web the odds were always stacked against us. The root of the problem is that TCP and in particular the congestion control algorithm that we all use - Slow Start happens to be very problematic in the context of the web and HTTP.

Whenever I download a web page from a site there are a series of underlying events that need to happen.

A connection needs to be established to the web server (1 round trip)
A request needs to be transmitted to the server.
The server needs to send us the data.

Simple.

I am able to download stuff at about 1 Meg a second. It follows that if I need to download a 30k web page I only need two round trips, first one to establish a connection. And the second one to ask for the data and get it. Since my connection is SO fast I can grab the data in lightning speed, even if my latency is bad.

My round trip to New York (from Sydney Australia) takes about 310ms (give or take a few ms)

Pinging stackoverflow.com [64.34.119.12] with 32 bytes of data:
Reply from 64.34.119.12: bytes=32 time=316ms TTL=43

It may get a bit faster as routers are upgraded and new fibre is laid, however it is governed by the speed of light. Sydney to New York is 15,988KM. The speed of light is approx 299,792KM per second. So the fastest amount of time I could possibly reach New York and back would be 106ms. At least until superluminal communication becomes reality.

Back to reality, two round trips to grab a 30k page is not that bad. However, once you start measuring … the results do not agree with the unsound theory.

reality

The reality is that downloading 34k of data often takes upwards of a second. What is going on? Am I on dial-up? Is my Internet broken? Is Australia broken?

Nope.

The reality is that to reach my maximal transfer speed TCP need to ramp up the number of segments that are allowed to be in transit a.k.a. the congestion window. RFC 5681 says that once a connection starts up you are allowed to have maximum of 4 segments initially in transit and unacknowledged. Once they are acknowledged the window grows exponentially. In general the initial congestion window (IW) on Linux and Windows is set to 2 or 3 depending on various factors. Also the algorithm used to amend the congestion window may differ (vegas vs cubic etc) but usually follows the pattern of exponential growth compensating for certain factors.

Say you have an initial congestion window set to 2 and you can fit 1452 bytes of data in a segment. Assuming you have an established connection infinite bandwidth and 0% packet loss it takes:

1 round trip to get 2904 bytes, Initial Window (IW) = 2
2 round trips to get 8712 bytes, Congestion Window (CW)=4
3 round trips to get 20328 bytes, CW = 8
4 round trips to get 43560 bytes, CW = 16

In reality we do get packet loss, and we sometimes only send acks on pairs, so the real numbers may be worse.

Transferring 34ks of data from NY to Sydney takes 4 round trips with an initial window of 2 which explains the image above. It makes sense that I would be waiting over a second for 34K.

You may think that Http Keepalive helps a lot, but it does not. The congestion window is reset to the initial value quite aggressively.

TCP Slow Start is there to protect us from a flooded Internet. However, all the parameters were defined tens of years ago in a totally different context. Way before broadband and HTTP were pervasive.

Recently, Google have been pushing a change that would allow us to increase this number to 10. This change is going to be ratified. How do I know? There are 3 reasons.

Google and Microsoft already implemented it on their servers.
More importantly, the Linux Kernel has adopted it.
Google needs this change ratified if SPDY is to be successful.

This change drastically cuts down the number of round trips you need to transfer data:

1 round trip to get 14520 bytes, IW = 10
2 round trips to get 43560 bytes, CW = 20

In concrete terms, the same page that took 1.3 seconds to download could take 650ms to download. Further more, we will have a much larger amount of useful data after the first round trip.

That is not the only issue we have that is causing the web to be slow, SPDY tries to solve some of the others such as: poor connection utilization, inability to perform multiple requests from a single connection concurrently (like HTTP Pipelining without FIFO ordering) and so on.

Unfortunately, even if SPDY is adopted we are still going to be stuck with 2 round trips for a single page. In some theoretic magical world we could get page transfer over SCTP which would allow us to cut down on a connection round trip (and probably introduce another 99 problems).

Show me some pretty pictures

Enough with theory, I went ahead and set up a small demonstration of this phenomena.

I host my blog on a VM, I updated this VM to the 3.2.0 Linux Kernel, using debian backports. I happen to have a second VM running on the same metal, which runs a Windows server release.

I created a simple web page that allows me to simulate the effect of round trips:

<!DOCTYPE html>
<html>
  <head>
      <title>35k Page</title>
      <style type="text/css">
        div {display: block; width: 7px; height: 12px; background-color: #aaa; float: left; border-bottom: 14px solid #ddd;}
        div.cp {background-color:#777;clear:both;}
      </style>
  </head>
  <body><div class='cp'></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div> ... etc

The cp class is repeated approximately every 1452 bytes to help approximate segments.

Then I used the awesome webpagetest.org to test for downloading the page. The results speak louder than anything else I wrote here (you can view it here):

starting

Both of the requests start off the same way, 1 round trip to set up the TCP connection and a second before any data appears. This places us at T+700ms. Then stuff diverges. The fast Linux Kernel is (top row) is able to get significantly more data through in the first pass. The Windows box delivers me 2 rows (which is approx 2 segments), the Linux one about 6.

continue

At 1.1 seconds the Windows box catches up temporarily but then 1.3 seconds in the Linux box delivers the second chunk of packets.

done

At 1.5 seconds the Linux box is done and the Windows box is only half way through.

done2

At 1.9 seconds the Windows box is done.

Translated into figures, the Linux box with the IW of 10 is 21 percent faster when you look at total time. If you discount the connection round trip it is about 25 percent faster. All of this without a single change to the application.

What does this mean to me?

Enterprise Linux distributions are slow to adopt the latest stable kernel. Enterprisey software likes to play it safe, for very good reasons. SUSE enterprise is possibly the first enterprise distro to ship a 3.0 kernel. Debian, CentOS, Red Hat and so on are all still on 2.6 kernels. This leaves you with a few options:

Play the waiting game, wait till your enterprise distro backports the changes into the 2.6 line or upgrades to a 3.0 Kernel.
Install a backported kernel.
Install a separate machine running say nginx and a 3.0 kernel and have it proxy your web traffic.

What about Windows?

Provided you are on Windows 2008 R2 and have You cannot customize some TCP configurations by using the netsh command in Windows Server 2008 R2 - Microsoft Support installed, you can update your initial congestion window using the command:

c:\netsh interface tcp set supplemental template=custom icw=10 
c:\netsh interface tcp set supplemental template=custom

See Andy’s blog for detailed instructions.

##Summary

At Stack Overflow, we see very global trends in traffic, with huge amounts of visits coming from India, China and Australia - places that are geographically very far from New York. We need to cut down on round trips if we want to perform well.

Sure, CDNs help, but our core dynamic content is yet to be CDN accelerated. This simple change can give us a 20-30% performance edge.

Every 100ms latency you have costs you 1% of your sales. Here is a free way to get a very significant speed boost.

Posted by: Sam Permalink | Comments (46)

Comments

Buffer_Bloat about 12 years ago

BufferBloat is a bigger factor (see Jim Gettys work on the subject)

Rob_Mueller about 12 years ago

You may think that Http Keepalive helps a lot, but it does not. The congestion window is reset to the initial value quite aggressively.

Linux has the option net.ipv4.tcp_slow_start_after_idle which can control this. Normally Linux resets to slow start settings after 3 seconds of idle time. As noted, that hurts performance on most keepalive connections, so it really helps to turn it off. We have:

net.ipv4.tcp_slow_start_after_idle = 0

In our /etc/sysctl.conf on our servers. This is especially helpful if your visitors tend to view many pages on your site rather than just one or two and you can afford many long lived keepalive connections.

http://blog.fastmail.fm/2011/06/28/http-keep-alive-connection-timeouts/

Sam Saffron about 12 years ago

excellent point, though setting it to never slow start open connections may be a bit aggressive

Frank_Denis about 12 years ago

If you're running OpenBSD, you can achieve the same result with this patch: http://download.pureftpd.org/misc/OpenBSD/patches/increase-initcwnd.patch

Sam Saffron about 12 years ago

thank you for the link!

Herv_Commowick about 12 years ago

Why are you talking about upgrading the kernel, when you can simply do :

ip route change default via MYGATEWAY dev MYDEVICE initcwnd 10

Sam Saffron about 12 years ago

my understanding was that the route trick is problamatic, see: Linux-Kernel Archive: Re: Raise initial congestion window size / speedup slow start?

Ahmet_Alp_Balkan about 12 years ago

I wonder if Apple implemented this in OS X Mach kernel.

Andrew_Rowson about 12 years ago

Are the benefits of this only realised on the internet-facing component? E.g. if you've got a farm of 10 IIS boxes behind a couple of HaProxy instances, is it pointless implementing this on the Windows boxes, given they'll never directly receive a request from the net? Or are there benefits to be had within the datacenter as well?

Nanne about 12 years ago

@andrew: I suppose it's of lesser importance as the roundtimes are quite low if the boxes are next to eachother? You save a couple of roundtrips if I read it correctly, which isn't that much if it's internal for a datacentre.

Dynamike about 12 years ago

It looks like the initial congestion window commit appears to have been placed in 2.6.39(http://kernelnewbies.org/Linux_2_6_39). In the 3.2 kernel(http://kernelnewbies.org/Linux_3.2) that you tested with implemented TCP Proportional Rate Reduction, which improves on the fast recovery when there's packet loss.

It seems like both changes could be helping out your test, but it's hard to say since the other instance you are running is windows.

Sam Saffron about 12 years ago

my understanding was that 2.6.39 was very short lived and quickly became 3

Tk about 12 years ago

In OS X all you need to do is set a sysctl setting. No need for even a restart. Check out net.inet.tcp.slowstart_flightsize (set it to 10) if you are interested.

Sam Saffron about 12 years ago

in bsd you also need to disable net.inet.tcp.rfc3390

Mateus about 12 years ago

The real problem is that since you are sitting in Australia, all transmited bits must be turned upside-down, so you browser can understand it.

Sam Saffron about 12 years ago

ÊŽÊƒÇÊ‡nÊƒosqÉ

Thomas_Kj_R about 12 years ago

The speed of light is 300kkm/s in vacuum. In fiber it's only 2/3 of that, namely 200kkm/s.

Ollie_Jones about 12 years ago

I sure wish we could hear an analysis of this change from Van Jacobson or the other designers of slow-start. Slow-start's purpose is to get TCP endpoints to restrain their traffic voluntarily to avoid congesting various routers in the network.

Another commenter pointed out that buffer bloat is also a huge factor. This may be especially true in the equipment that drives long oceanic fiber-optic lines. Is jacking up the initial window just going to make the buffer bloat problem worse?

Google, Microsoft and Stack Overflow have access to top-notch traffic engineers to sort all this out. I know that I don't have that access. It's possible that this change, widely and indiscriminately applied, might actually slow things down. So, for my part, I'm going to hold off until it's approved by the IETF.

Sam Saffron about 12 years ago

For Stack Overflow’s case we have a very non-abusive web page structure, effectively the only traffic served from NY to our customers is the page payload, the rest of the supporting resources come from a CDN.

The main objection to IW10 is abusive web page structuring that will do pretty much every trick in the book to open as many connections possible from the client to a single server. (domain sharding being a really nasty one imho, image sprites are a much more sane approach)

It is possible that in some cases the change will cause harm, however the cat is already out of the bag.

Tim_Post about 12 years ago

Just a note, it is trivial to build a kernel against your existing configuration and get it into place, often without conflicting with your package manager. Build, install, then hold kernel packages back (easy to do with apt/rpm). When your distro finally ships the kernel you want, don't hold the kernel packages back any longer.

The â€˜funk' case here (no matter which way you go) can be security updates, if another update depends on a kernel fix, however that seldom happens. Still, all you have to do is back up your grub config, let it apply the update, then restore your grub config.

In my not so humble opinion, EL distros stay a little too far back, especially with interpreted languages. I agree with having a buffer to ease deprecation, but I feel the gap its too big.

Nick_Storm about 12 years ago

Good article! That was something I never knew about, and now I feel like doing some more research. Thanks

Guizzmo about 12 years ago

Just an offside comment : even if quantum teleportation kicks in somedays, the maximum speed of communication remains the speed of light.

I know a lot of people are confused about the word â€œteleportationâ€ and believe it to be instantaneous. Even if it is somehow true, no information can be transmitted faster than light in vacuum. Source : http://en.wikipedia.org/wiki/Faster-than-light#Quantum_mechanics

Robin about 12 years ago

@Guizzmo but Quantum Teleportation could, as I understand it, significantly increase (double?) the amount of information that could be communicated per round-trip.

Andy_Davies about 12 years ago

If you need to know how to do it on Windows â€“ http://www.andysnotebook.com/2011/11/increasing-the-tcp-initial-congestion-window-on-windows-2008-server-r2.html

Mike_Kale about 12 years ago

Upside down bits notwithstanding, you do have a higher latency connection than many folks and the benefits may be lower for others. Still, sounds like a reasonable change.

Pyry about 12 years ago

We have been using the already mentioned:

ip route change default via MYGATEWAY dev MYDEVICE initcwnd 10

We haven't seen any issues with it, but we have also increased the tcp_wmem[1] value from 16k to 64k. I didn't do any tests with 16k. Also due to a bug â€œip route initcwndâ€ only works for kernels >= 2.6.30

Dave_T_Ht about 12 years ago

bufferbloat â€“ which has some major fixes for it in the upcoming linux 3.3 kernel â€“ is mostly a problem on congested networks.

Some info on bufferbloat

http://cacm.acm.org/magazines/2012/2/145415-bufferbloat-whats-wrong-with-the-internet/fulltext

For a different take on iw10 as to it's effect on latency and response time, see:

http://tools.ietf.org/html/draft-gettys-iw10-considered-harmful-00

Sam Saffron about 12 years ago

Thanks for the links, interesting. In Stack Overflow’s case none of the concerns apply. The only round trip to NY for our pages is grabbing the HTML, rest of the content is shipped from CDNs.

Michael about 12 years ago

Not quantum teleportation, rather quantum non-locality and enabling super luminal communications.

Sam Saffron about 12 years ago

Thanks, amending my link to Faster-than-light - Wikipedia

Jonathan about 12 years ago

Is this resolved for Windows 7 and if not does a patch exist for that version of Windows?

Sam Saffron about 12 years ago

this is not something you would even consider on non Internet facing web servers.

All Windows releases to date use a IW of 2-3 .

Ethan about 12 years ago

I would like to suggest another way to cut down wait time and improve performance: Websocket.

Diq about 12 years ago

How about you just deploy more POPs? Seems like the sensible thing to do for a lot of standpoints.

I don't think providing a better experience for a few million people in Australia is worth the dangers of blowing out buffers for everyone in the US and EU with quick RTT's. Even if you say Stack Overflow isn't running the risk of that, your pages are tiny, etc etc, you're publishing it on the Internet and people who don't understand the risks are going to look at this and say â€œWell Stack Overflow does it so I should too.â€

I might be stretching here a little bit, but your posting of this is somewhat dangerous. Also, I guess you don't care about mobile users at all?

Sam Saffron about 12 years ago

I very much object to this, if you believe we have 3 stable releases out there of the Linux kernel that are going to break the Internet, you should raise this with the Linux kernel team on the mailing list.

Diq about 12 years ago

Well, the Linux kernel team is not the end-all be-all of what's good for everyone. Sometimes they do things for altruistic reasons. Sometimes they do things for selfish reasons. Sometimes they do things just to see what will happen. Let's be honest here. Your saying â€œit's in Linux it must be safeâ€ is exactly what I mentioned earlier â€” â€œStack Overflow is doing it, it must be good for what we're doing.â€

I'm not alone in thinking that this isn't a great idea for all traffic. It might work OK for HTTP, but not everything is HTTP (contrary to Google's view of the world). These things need to be configurable and adaptive to different use cases. Can we store the metrics in the route cache and re-use it for existing connections? IWND of 4 for this /24, 10 for this /8, etc? I haven't seen any references to functionality like this.

Also, more geographically diverse POPs would obviate the need for this change and be beneficial for a lot of other reasons.

Sam Saffron about 12 years ago

I agree that an adaptive auto-tuning per IP approach may be helpful. I also can think of a few particular use cases where this may be problematic, domain sharding with HTTP being the most dangerous of them.

I also encourage people to test. Everything. Don’t make changes blindly to your stack without having a way to measure that the change helped things. We test everything, when we implement this change we will have reporting that shows the effect of it across all geographic regions. We sample performance on a percentage of our traffic. And yes, we do care about mobile as well.

IW10 cuts a 30k page to two round trips, this can have a material improvement even if a round trip is 40ms. People perceive delays larger than 100ms.

We already have a CDN which takes care of distributing our content globally, it does not cover dynamic pages, an http accelerator for dynamic pages is a complex and expensive thing to set up. See Akami’s offerings.

Opinions are fine, but please, back it up with some empirical and scientific research that demonstrates where this is problamatic and how problamatic this becomes.

In the age of bittorrent the “IW is now 3-5 times bigger” issue is minute compared to having millions of people running a constant flood of packets saturating pretty much every buffer out there. The main traffic to lose out to-date has been HTTP.

I still object that you are raising this issue with me, it needs to be discussed on the Linux Kernel mailing list, there may be a ton of extra protections in place already for all I know. By discussing it there you will have more of a chance to make a positive and constructive contribution to this discussion.

Felipe about 12 years ago

Has anyone tried updating the kernel and doing some real-world tests ?

I upgraded our Linode's Kernel from 2.6 to 3.0.18 and ran webpagetest.org (before and after the upgrade).

I wasn't able to get not even a 1% improvement, and I ran 5 tests (first load and repeat) each time so I could get a good average.

Sam Saffron about 12 years ago

2.6.39 introduced this change, so it would have to be an earlier version, also if you are testing really close to your geographic location it is possible you would not notice the change that much, care to link to the tests on webpage test, would not mind running a few myself against your site

Quintin about 12 years ago

Sam, I have a similar issue like Felipe. Question posted at serverfault

http://serverfault.com/questions/365975/linux-slow-start-changing-ip-route-does-not-have-any-effect-on-initial-window

I've also tagged all my slow start related questions under tcp-slow-start

Can you help?

Sam Saffron about 12 years ago

on holidays at the moment, will respond when I get back, strongly recommend you test this from a windows client.

Bryan_Livingston about 12 years ago

So I uploaded your test file to my windows static content server: http://mob0.com/35k%20Page.htm

And then benchmarked (via webpagetest and firebug) with original config and the window changed and received similar performance boost! About 25%! Confirmed.

Cbp about 12 years ago

The distant from Sydney to New York is much shorter if you go directly through the Earth's core. Neutrinos can travel straight through the Earth at almost the speed of light, so it seems that it may be theoretically possible to beat your figure of 106ms.

Jeff_Blaine about 12 years ago

Sam, RHEL 6.2 comes with TCP initial congestion window of 10. Since Dec 6, 2011.

Sam Saffron about 12 years ago

nice one, I did not know that

Sam Saffron over 10 years ago

Also there is the voodoo quantum entanglement which could possibly make stuff faster.

Sam Saffron