Testing 3 million hyperlinks, lessons learned
over 12 years ago
There are over 3 million distinct links in the Stack Exchange network. Over time many of these links rot and stop working.
Recently, I spent some time writing tools to determine which links are broken and assist the community in fixing them.
How we do it?
First things first, we try to be respectful of other people’s websites.
Being a good web citizen
- Throttle requests per domain
We use this automatically expiring set to ensure we do not hit a domain more than once every ten seconds, we make a handful of exceptions where we feel we need to test links a bit more aggressively:
public class AutoExpireSet<T>
{
Dictionary<T, DateTime> items = new Dictionary<T, DateTime>();
Dictionary<T, TimeSpan> expireOverride =
new Dictionary<T, TimeSpan>();
int defaultDurationSeconds;
public AutoExpireSet(int defaultDurationSeconds)
{
this.defaultDurationSeconds =
defaultDurationSeconds;
}
public bool TryReserve(T t)
{
bool reserved = false;
lock (this)
{
DateTime dt;
if (!items.TryGetValue(t, out dt))
{
dt = DateTime.MinValue;
}
if (dt < DateTime.UtcNow)
{
TimeSpan span;
if (!expireOverride.TryGetValue(t, out span))
{
span =
TimeSpan.FromSeconds(defaultDurationSeconds);
}
items[t] = DateTime.UtcNow.Add(span);
reserved = true;
}
}
return reserved;
}
public void ExpireOverride(T t, TimeSpan span)
{
lock (this)
{
expireOverride[t] = span;
}
}
}
- A robust validation function:
Our validation function captures many concepts I feel are very important.
public ValidateResult Validate(
bool useHeadMethod = true,
bool enableKeepAlive = false,
int timeoutSeconds = 30 )
{
ValidateResult result = new ValidateResult();
HttpWebRequest request = WebRequest.Create(Uri)
as HttpWebRequest;
if (useHeadMethod)
{
request.Method = "HEAD";
}
else
{
request.Method = "GET";
}
// always compress, if you get back a 404 from a HEAD
// it can be quite big.
request.AutomaticDecompression = DecompressionMethods.GZip;
request.AllowAutoRedirect = false;
request.UserAgent = UserAgentString;
request.Timeout = timeoutSeconds * 1000;
request.KeepAlive = enableKeepAlive;
HttpWebResponse response = null;
try
{
response = request.GetResponse() as HttpWebResponse;
result.StatusCode = response.StatusCode;
if (response.StatusCode ==
HttpStatusCode.Redirect ||
response.StatusCode ==
HttpStatusCode.MovedPermanently ||
response.StatusCode ==
HttpStatusCode.SeeOther ||
response.StatusCode ==
HttpStatusCode.TemporaryRedirect)
{
try
{
Uri targetUri =
new Uri(Uri, response.Headers["Location"]);
var scheme = targetUri.Scheme.ToLower();
if (scheme == "http" || scheme == "https")
{
result.RedirectResult =
new ExternalUrl(targetUri);
}
else
{
// this little gem was born out of
// http://tinyurl.com/18r
// redirecting to about:blank
result.StatusCode =
HttpStatusCode.SwitchingProtocols;
result.WebExceptionStatus = null;
}
}
catch (UriFormatException)
{
// another gem ... people sometimes redirect to
// http://nonsense:port/yay
result.StatusCode =
HttpStatusCode.SwitchingProtocols;
result.WebExceptionStatus =
WebExceptionStatus.NameResolutionFailure;
}
}
}
catch (WebException ex)
{
result.WebExceptionStatus = ex.Status;
response = ex.Response as HttpWebResponse;
if (response != null)
{
result.StatusCode = response.StatusCode;
}
}
finally
{
try
{
request.Abort();
}
catch
{ /* ignore in case already
aborted or failure to abort */
}
if (response != null)
{
response.Close();
}
}
return result;
}
- From day 0 set yourself up with a proper User Agent String.
If somehow anything goes wrong you want people to be able to contact you and inform you. Our link crawler has the user agent string of Mozilla/5.0 (compatible; stackexchangebot/1.0; +http://meta.stackoverflow.com/q/130398)
.
- Handle 302s, 303s and 307s
Even though error codes 302 and 303 are fairly common, there is a less common 307 redirect. It was introduced as a hack to work around misbehaving browsers as explained here.
A prime example of a 307 would be http://www.haskell.org. I strongly disagree with a redirect on a home page, URL rewrite and many other tools can deal with this use case without the extra round trip; nonetheless, it exists.
When you get a redirect, you need to continue testing. Our link tester will only check up to 5 levels deep. You MUST have some depth limit set, otherwise you can easily find yourself in an infinite loop.
Redirects are odd beasts, web sites can redirect you to about:config
or to an invalid URL. It is important to validate the information you got from the redirect.
- Always abort your request once you have the information you need.
In the TCP protocol, when packets are acknowledged special status flags can be set. If the client sends the server an packet with the FIN
flag set the connection is terminated early. By calling request.Abort
you can avoid downloading a large and possibly big payload from the server in case of a 404.
When testing links, you often want to avoid HTTP keepalive as well. There is no reason to burden the servers with additional connection maintenance when our tests are far apart.
A functioning abort also diminishes from the importance of compression, however I still recommend enabling compression anyway.
- Always try HEAD requests first then fall-back to GET requests
Some web servers disallow the HEAD
verb. For example, Amazon totally bans it, returning a 405 on HEAD requests. In ASP.NET MVC, often people explicitly set the verbs the router passes through. Often developers overlook adding HttpVerbs.Head
when restricting a route to HttpVerbs.Get
. The result of this is that if you fail (don’t get a redirect or a 200) you need to retry your test with the GET
verb.
- Ignore robots.txt
Initially I planned on being a good Netizen and parsing all robots.txt files, respecting exclusions and crawl rates. The reality is that many sites such as GitHub, Delicious and facebook all have a white-list approach to crawling. All crawlers are banned except for ones they explicitly allow (usually Google, Yahoo and Bing). Since a link checker is not spidering a web site and it is impractical to respect robots.txt, I recommend ignoring it, with the caveat of the crawl rate - you should respect that. This was also discussed on Meta Stack Overflow.
- Have a sane timeout
When testing links we allow sites 30 seconds to respond, some sites may take longer … much longer. You do not want to heavily block your link tester due to a malfunctioning site. I would consider a 30 second response time a malfunction.
- Use lots and lots of threads to test links
I run the link validator from my dev machine in Sydney, clearly serializing 3 million web requests that take an undetermined amount of time is not going to progress at any sane rate. When I run my link validator I use 30 threads.
Concurrency also raises a fair technical challenge considering the above constraints. You do not want to block a thread cause you are waiting for a slot on a domain to free up.
I use my Async class to manage the queue. I prefer it over the Microsoft Task Parallel Library for this use case, cause the semantics for restricting the number of threads in a pool is trivial and the API is very simple a lean.
- Broken once does not mean broken forever
I am still adjusting the algorithm that determines if a link is broken or not. One failure can always be a fluke. A couple of failures in a week could be a bad server crash or an unlucky coincidence.
At the moment 2 failures, a day apart do seem to be correct most of the time - so instead of finding the perfect algorithm we will allow users to tell us when we made a mistake and assume a small margin of error.
In a similar vein we still need to determine how often we should test links after a successful test. I think, once every 3 months should suffice.
Some interesting observations from my link testing
###Kernel.org was hacked
On the 1st of September 2011 Kernel.org was hacked, what does this have to do with testing links you may ask?
Turns out that they broke a whole bunch of documentation links, these links remain broken today. For example: http://www.kernel.org/pub/software/scm/git/docs/git-svn.html
appeared in 150 or so posts on Stack Overflow, yet now takes you to an unkind 404 page, instead of its new home at: http://git-scm.com/docs/git-svn
. Of all the broken links I came across the broken git documentation is the worst failure. Overall it affected over 6000 posts on Stack Overflow. Fixing it in an Apache rewrite rule would be trivial.
Some sites like giving you no information in the URL
The link http://www.microsoft.com/downloads/details.aspx?familyid=e59c3964-672d-4511-bb3e-2d5e1db91038&displaylang=en
is broken in 60 or so posts. Imagine if the link was http://www.microsoft.com/downloads/ie-developer-toolbar-beta-3
. Even when Microsoft decided to nuke this link from the Internet we could still make a sane guess as to where it would take me.
###Make your 404 page special and useful - lessons from GitHub
Of all the 404 pages I came across, the one on GitHub enraged me most.
Why you ask?
It looks AWESOME, there is even an AMAZING parallax effect. Haters gonna hate.
Well, actually.
https://github.com/dbalatero/typhoeus
is linked from 50 or so posts, it has moved to https://github.com/typhoeus
. GitHub put no redirect in place and simply take you to a naked 404.
It would be trivial to do some rudimentary parsing on the url string to determine where you really wanted to go:
I am sorry, we could not find the page you have linked to. Often users rename their accounts causing links to break. The “typhoeus” repository also exists at:
There you go, no smug message telling me I made a mistake attempting Jedi mind tricks on me. GitHub should take ownership of their 404 pages and make them useful. What bothers me with the GitHub 404 most, is the amount of disproportionate effort invested. Instead of giving me pretty graphics, can I have some useful information please.
You could also take this one step further and properly redirect repositories to their new homes, I understand that account renaming is tricky business, however it seems to be an incredibly common reason for 404 errors on GitHub.
At Stack Overflow we spent a fair amount of time optimising for cases like this. For example take “What is your favourite programmer joke?”. The community decided this question does not belong. We do our best to explain it was removed, why it was removed and where you could possibly find it.
Oracle’s dagger
Oracle’s acquisition of Sun dealt a permanent and serious blow to the Java ecosystem. Oracle’s strict mission to re-brand and restructure the Java ecosystem was mismanaged. A huge amount of documentation was not redirected initially. Even to-date all the projects under dev.java.net
do not have proper redirects in place. Hudson, the Java continuous integration server used to live at https://hudson.dev.java.net/
, it is linked from over 150 Stack Overflow posts.
##Personal lessons
The importance of the href title
In the age of URL shortners and the rick roll it seems that having URIs convey any sane information about where they will take you is less than encouraged. The reality though is that over 3 years probably 5% of the links you have are going to simply stop working. I am sure my blog is plagued with a ton of broken links as well. Fixing broken links is a difficult task. Fixing it without context is much harder.
That is one big reason I am now going to think a little bit more about writing sane titles for my hyperlinks. Not only does this encourage usability, improve search engine results and help the visually impaired, it helps me fix these links when they eventually break.
###The fragility of the hyperlink
When we use Google we never get a 404. It shields us quite effectively from an ever crumbling Internet. Testing a large amount of links teaches you that the reality is far from peachy. Does this mean I should avoid linking? Heck no, but being aware of this fact can help me think about my writing. I would like to avoid writing articles that lose meaning when a link dies. On Stack Overflow we often see answers to the affect of:
See this blog post over here.
These kind of answers fall apart when the external resource dies and neglect to acknowledge the nature of the Internet.
Addendum (10 / 2013): GitHub now support repo redirects: Repository redirects are here! - The GitHub Blog , Kernel.org fixed most of these links.
Interesting post. I especially liked the GitHub 404 bit. It is annoying to see a blank 404 page given someone spent some time on it to do the fancy graphics.
A technique I now use is to post the relevant bits of the article I'm linking to in my answer, so if the link dies the answer is still useful.