Testing 3 million hyperlinks, lessons learned

almost 12 years ago

There are over 3 million distinct links in the Stack Exchange network. Over time many of these links rot and stop working.

Recently, I spent some time writing tools to determine which links are broken and assist the community in fixing them.

How we do it?

First things first, we try to be respectful of other people’s websites.

Being a good web citizen

Throttle requests per domain

We use this automatically expiring set to ensure we do not hit a domain more than once every ten seconds, we make a handful of exceptions where we feel we need to test links a bit more aggressively:

public class AutoExpireSet<T>
{

    Dictionary<T, DateTime> items = new Dictionary<T, DateTime>();
    Dictionary<T, TimeSpan> expireOverride = 
         new Dictionary<T, TimeSpan>();

    int defaultDurationSeconds; 

    public AutoExpireSet(int defaultDurationSeconds)
    {
        this.defaultDurationSeconds = 
           defaultDurationSeconds;
    }


    public bool TryReserve(T t)
    {
        bool reserved = false;
        lock (this)
        {
            DateTime dt;
            if (!items.TryGetValue(t, out dt))
            {
                dt = DateTime.MinValue;
            }

            if (dt < DateTime.UtcNow)
            {
                TimeSpan span;
                if (!expireOverride.TryGetValue(t, out span))
                {
                    span = 
                     TimeSpan.FromSeconds(defaultDurationSeconds);
                }
                items[t] = DateTime.UtcNow.Add(span);
                reserved = true;
            }
               
        }
        return reserved;
    }


    public void ExpireOverride(T t, TimeSpan span)
    {
        lock (this)
        {
            expireOverride[t] = span;
        }
    }
}

A robust validation function:

Our validation function captures many concepts I feel are very important.

public ValidateResult Validate(
      bool useHeadMethod = true, 
      bool enableKeepAlive = false, 
      int timeoutSeconds = 30 )
{
    ValidateResult result = new ValidateResult();

    HttpWebRequest request = WebRequest.Create(Uri) 
                                  as HttpWebRequest;
    if (useHeadMethod)
    {
        request.Method = "HEAD";
    }
    else
    {
        request.Method = "GET";
    }

    // always compress, if you get back a 404 from a HEAD
    //     it can be quite big.
    request.AutomaticDecompression = DecompressionMethods.GZip;
    request.AllowAutoRedirect = false;
    request.UserAgent = UserAgentString;
    request.Timeout = timeoutSeconds * 1000;
    request.KeepAlive = enableKeepAlive;

    HttpWebResponse response = null;
    try
    {
        response = request.GetResponse() as HttpWebResponse;

        result.StatusCode = response.StatusCode;
        if (response.StatusCode == 
                   HttpStatusCode.Redirect ||
            response.StatusCode == 
                   HttpStatusCode.MovedPermanently ||
            response.StatusCode == 
                   HttpStatusCode.SeeOther || 
            response.StatusCode == 
                   HttpStatusCode.TemporaryRedirect)
        {
            try
            {
                Uri targetUri = 
                  new Uri(Uri, response.Headers["Location"]);
                var scheme = targetUri.Scheme.ToLower();
                if (scheme == "http" || scheme == "https")
                {
                    result.RedirectResult = 
                        new ExternalUrl(targetUri);
                }
                else
                {
                    // this little gem was born out of 
                    //   http://tinyurl.com/18r 
                    //   redirecting to about:blank
                    result.StatusCode = 
                           HttpStatusCode.SwitchingProtocols;
                    result.WebExceptionStatus = null;
                }
            }
            catch (UriFormatException)
            {
                // another gem ... people sometimes redirect to
                //    http://nonsense:port/yay 
                result.StatusCode = 
                    HttpStatusCode.SwitchingProtocols;
                result.WebExceptionStatus =
                    WebExceptionStatus.NameResolutionFailure;
            }
                    
        }
    }
    catch (WebException ex)
    {
        result.WebExceptionStatus = ex.Status;
        response = ex.Response as HttpWebResponse;
        if (response != null)
        {
            result.StatusCode = response.StatusCode;
        }
    }
    finally
    {
        try
        {
           request.Abort();
        }
        catch 
        { /* ignore in case already 
           aborted or failure to abort */ 
        }
        
        if (response != null)
        {
            response.Close();
        }
    }

    return result;
}

From day 0 set yourself up with a proper User Agent String.

If somehow anything goes wrong you want people to be able to contact you and inform you. Our link crawler has the user agent string of Mozilla/5.0 (compatible; stackexchangebot/1.0; +http://meta.stackoverflow.com/q/130398).

Handle 302s, 303s and 307s

Even though error codes 302 and 303 are fairly common, there is a less common 307 redirect. It was introduced as a hack to work around misbehaving browsers as explained here.

A prime example of a 307 would be http://www.haskell.org. I strongly disagree with a redirect on a home page, URL rewrite and many other tools can deal with this use case without the extra round trip; nonetheless, it exists.

When you get a redirect, you need to continue testing. Our link tester will only check up to 5 levels deep. You MUST have some depth limit set, otherwise you can easily find yourself in an infinite loop.

Redirects are odd beasts, web sites can redirect you to about:config or to an invalid URL. It is important to validate the information you got from the redirect.

Always abort your request once you have the information you need.

In the TCP protocol, when packets are acknowledged special status flags can be set. If the client sends the server an packet with the FIN flag set the connection is terminated early. By calling request.Abort you can avoid downloading a large and possibly big payload from the server in case of a 404.

When testing links, you often want to avoid HTTP keepalive as well. There is no reason to burden the servers with additional connection maintenance when our tests are far apart.

A functioning abort also diminishes from the importance of compression, however I still recommend enabling compression anyway.

Always try HEAD requests first then fall-back to GET requests

Some web servers disallow the HEAD verb. For example, Amazon totally bans it, returning a 405 on HEAD requests. In ASP.NET MVC, often people explicitly set the verbs the router passes through. Often developers overlook adding HttpVerbs.Head when restricting a route to HttpVerbs.Get. The result of this is that if you fail (don’t get a redirect or a 200) you need to retry your test with the GET verb.

Ignore robots.txt

Initially I planned on being a good Netizen and parsing all robots.txt files, respecting exclusions and crawl rates. The reality is that many sites such as GitHub, Delicious and facebook all have a white-list approach to crawling. All crawlers are banned except for ones they explicitly allow (usually Google, Yahoo and Bing). Since a link checker is not spidering a web site and it is impractical to respect robots.txt, I recommend ignoring it, with the caveat of the crawl rate - you should respect that. This was also discussed on Meta Stack Overflow.

Have a sane timeout

When testing links we allow sites 30 seconds to respond, some sites may take longer … much longer. You do not want to heavily block your link tester due to a malfunctioning site. I would consider a 30 second response time a malfunction.

Use lots and lots of threads to test links

I run the link validator from my dev machine in Sydney, clearly serializing 3 million web requests that take an undetermined amount of time is not going to progress at any sane rate. When I run my link validator I use 30 threads.

Concurrency also raises a fair technical challenge considering the above constraints. You do not want to block a thread cause you are waiting for a slot on a domain to free up.

I use my Async class to manage the queue. I prefer it over the Microsoft Task Parallel Library for this use case, cause the semantics for restricting the number of threads in a pool is trivial and the API is very simple a lean.

Broken once does not mean broken forever

I am still adjusting the algorithm that determines if a link is broken or not. One failure can always be a fluke. A couple of failures in a week could be a bad server crash or an unlucky coincidence.

At the moment 2 failures, a day apart do seem to be correct most of the time - so instead of finding the perfect algorithm we will allow users to tell us when we made a mistake and assume a small margin of error.

In a similar vein we still need to determine how often we should test links after a successful test. I think, once every 3 months should suffice.

Some interesting observations from my link testing

###Kernel.org was hacked

On the 1st of September 2011 Kernel.org was hacked, what does this have to do with testing links you may ask?

Turns out that they broke a whole bunch of documentation links, these links remain broken today. For example: http://www.kernel.org/pub/software/scm/git/docs/git-svn.html appeared in 150 or so posts on Stack Overflow, yet now takes you to an unkind 404 page, instead of its new home at: http://git-scm.com/docs/git-svn. Of all the broken links I came across the broken git documentation is the worst failure. Overall it affected over 6000 posts on Stack Overflow. Fixing it in an Apache rewrite rule would be trivial.

Some sites like giving you no information in the URL

The link http://www.microsoft.com/downloads/details.aspx?familyid=e59c3964-672d-4511-bb3e-2d5e1db91038&displaylang=en is broken in 60 or so posts. Imagine if the link was http://www.microsoft.com/downloads/ie-developer-toolbar-beta-3. Even when Microsoft decided to nuke this link from the Internet we could still make a sane guess as to where it would take me.

###Make your 404 page special and useful - lessons from GitHub

Of all the 404 pages I came across, the one on GitHub enraged me most.

Why you ask?

GitHub 404

It looks AWESOME, there is even an AMAZING parallax effect. Haters gonna hate.

Well, actually.

https://github.com/dbalatero/typhoeus is linked from 50 or so posts, it has moved to https://github.com/typhoeus. GitHub put no redirect in place and simply take you to a naked 404.

It would be trivial to do some rudimentary parsing on the url string to determine where you really wanted to go:

I am sorry, we could not find the page you have linked to. Often users rename their accounts causing links to break. The “typhoeus” repository also exists at:

Typhoeus · GitHub

There you go, no smug message telling me I made a mistake attempting Jedi mind tricks on me. GitHub should take ownership of their 404 pages and make them useful. What bothers me with the GitHub 404 most, is the amount of disproportionate effort invested. Instead of giving me pretty graphics, can I have some useful information please.

You could also take this one step further and properly redirect repositories to their new homes, I understand that account renaming is tricky business, however it seems to be an incredibly common reason for 404 errors on GitHub.

At Stack Overflow we spent a fair amount of time optimising for cases like this. For example take “What is your favourite programmer joke?”. The community decided this question does not belong. We do our best to explain it was removed, why it was removed and where you could possibly find it.

Oracle’s dagger

Oracle’s acquisition of Sun dealt a permanent and serious blow to the Java ecosystem. Oracle’s strict mission to re-brand and restructure the Java ecosystem was mismanaged. A huge amount of documentation was not redirected initially. Even to-date all the projects under dev.java.net do not have proper redirects in place. Hudson, the Java continuous integration server used to live at https://hudson.dev.java.net/, it is linked from over 150 Stack Overflow posts.

##Personal lessons

The importance of the href title

In the age of URL shortners and the rick roll it seems that having URIs convey any sane information about where they will take you is less than encouraged. The reality though is that over 3 years probably 5% of the links you have are going to simply stop working. I am sure my blog is plagued with a ton of broken links as well. Fixing broken links is a difficult task. Fixing it without context is much harder.

That is one big reason I am now going to think a little bit more about writing sane titles for my hyperlinks. Not only does this encourage usability, improve search engine results and help the visually impaired, it helps me fix these links when they eventually break.

###The fragility of the hyperlink

When we use Google we never get a 404. It shields us quite effectively from an ever crumbling Internet. Testing a large amount of links teaches you that the reality is far from peachy. Does this mean I should avoid linking? Heck no, but being aware of this fact can help me think about my writing. I would like to avoid writing articles that lose meaning when a link dies. On Stack Overflow we often see answers to the affect of:

See this blog post over here.

These kind of answers fall apart when the external resource dies and neglect to acknowledge the nature of the Internet.

Addendum (10 / 2013): GitHub now support repo redirects: Repository redirects are here! - The GitHub Blog , Kernel.org fixed most of these links.

Posted by: Sam Permalink | Comments (36)

Comments

Cameron almost 12 years ago

Interesting post. I especially liked the GitHub 404 bit. It is annoying to see a blank 404 page given someone spent some time on it to do the fancy graphics.

A technique I now use is to post the relevant bits of the article I'm linking to in my answer, so if the link dies the answer is still useful.

Sam Saffron almost 12 years ago

Indeed, Jeff used to edit posts quite often to include context, you are doing the right thing there.

Tim_Post almost 12 years ago

It's been rather interesting seeing the comments you made on Twitter while you were working on this, and thanks for taking a moment to expand on them and sum the experience up nicely.

I did similar work, I maintain an actual proprietary crawler for one of my clients (mostly internal stuff, but that breaks just as often). I ended up going with Redis for throttling and counting, which also solved quite a few concurrency issues since I run checks from several different servers.

Sam Saffron almost 12 years ago

cool, thanks, totally sane to use Redis for throttling and logging it fits perfectly for this kind of job. Having a good log to go back to is so important.

concurrency is really tough especially when dealing with the redirect chains.

Brian_Cardarella almost 12 years ago

Awesome post.

Virgil_Griffith almost 12 years ago

You my good sir are notably competent.

Arpit_Bansal almost 12 years ago

Nice !!

Ben_Powell almost 12 years ago

Excellent post Sam. There are some gems in there. Thanks for sharing the code snippets as well.

Tony_Edgecombe almost 12 years ago

This is a good reminder to us all to avoid breaking url's whenever possible.

Pete_Duncanson almost 12 years ago

I liked this post a lot. I love this sort of coding, data processing jobs where you keep finding quirks that you need to patch. I feel like a detective when doing this sort of work. Really interesting read. Thanks.

Sam Saffron almost 12 years ago

Yes, very much so. So much effort sometimes has to go in to so little code

David_Landgren almost 12 years ago

â€œIn a similar vainâ€ => vein

Sam Saffron almost 12 years ago

thanks fixed.

Philip almost 12 years ago

This is why I try to avoid using the â€œlinkâ€ option on Stack Overflow page that results in the link http://stackoverflow.com/a/3623727 rather than the more informative link http://stackoverflow.com/questions/2604727/how-can-i-connect-to-android-with-adb-over-tcp#3623727.

(Although I've just realised the anchor name is the same as the id in the URL so at least I don't have to did in the page source anymore.)

Time_Lord almost 12 years ago

Try googling â€œrobust hyperlinksâ€.

Perhaps SO could automatically slurp down the content of each submitted link and add as a fallback a search using five distinctive search terms that are sufficient to locate a linked page.

Sam Saffron almost 12 years ago

yeah that is an interesting approach. at SO scale it would be a fair technical challenge.

Time_Lord almost 12 years ago

Looks like one of the StackFathers thought of this too: http://www.codinghorror.com/blog/2004/08/unbreakable-links-revisited.html

â€œRobust Hyperlinks Cost Just Five Words Eachâ€ is the title of the more readable version of the idea and easier for me to write down than paste in a URL.

Svick almost 12 years ago

I don't understand your point about ServicePointManager.DefaultConnectionLimit. It limits the number of connections per domain, so if you have a limit of one request to a domain per 10 seconds, I think this shouldn't affect you.

And restricting the number of threads using TPL is not difficult, you can use LimitedConcurrencyLevelTaskScheduler from ParallelExtensionsExtras.

Also, it might make sense to consider using await when it's released, it will mean you don't have to use 30 threads to have 30 concurrent requests.

P.S. Thanks for fixing the git manual links.

Sam Saffron almost 12 years ago

I am going to remove that point, I swear that during testing increasing that number made a global difference, in retrospect this was probably a bug.

As to the TPL thing, I remember seeing those extensions, agree they are fine to use, ironing out all the bugs from my Async class has taken literally years.

Also agree about the async semantics being possibly handy at driving up throughput with a reduced thread count.

Jed almost 12 years ago

Threads do make this type of thing easier to understand but they're only needed because HttpWebRequest GetResponse is sync-ing asynchronous web requests.

There is a whole â€˜nother layer hiding under this; it is still kind of a black art (timeouts have to be re-implemented manually last I heard) but ideally only a couple threads are busy kicking off a bunch of callbacks.

Sam Saffron almost 12 years ago

yeah asynching this is another approach which can work fine.

Salman almost 12 years ago

So how long did it take to spider the 3 million links?

Sam Saffron almost 12 years ago

I can test about 100k links an hour. I changed so many params during the process that the total time is not indicative.

Gwern almost 12 years ago

One thing SO could do as a company is sign up for the Internet Archive's Archive-it service: on-demand archiving.

Then one could, say, once a day submit all new URLs to one's Archive-it account, and reload all 3M URLs every year or two; with guaranteed copies in the Internet Archive, fixing external links is very easy indeed.

Sam Saffron almost 12 years ago

I like this idea, will mention to the team

Kevin almost 12 years ago

Thanks for the post Sam, I recently did the same analysis on my 5-year-old site and found that out of 2300 links, 230 of them were broken.

At that time I had an idea that SO could rewrite all external links using an internal link shortener, so all links to the git-scm page are going through the same SO link. If the external link broke, someone could update the SO short link and fix the content on all 150 pages. A search engine unique-link-builder could work too.

http://kev.inburke.com/kevin/broken-links/

Erick_T almost 12 years ago

I did a similar project a few years ago and had a lot of the same problems. I changed the system that stored the links so that when a link was added, the system grabbed the HTML (a la Google cache) and also did an image screenshot (to avoid missing resources in cached HTML). It wouldn't help you at this point, but making a change like this would make your life easier later.

Sam Saffron almost 12 years ago

Erick, wow that sound excellent. A very smart system you built there, have you had a chance to blog about it?

Unni almost 12 years ago

Checked this 404? www.foradian.com/404

Konstantin_Ryabitsev almost 12 years ago

Sorry for the broken git doc hyperlinks. We had to completely overhaul how we publish docs from git trees, which is why it took a long time. The old links should work now. See?

http://www.kernel.org/pub/software/scm/git/docs/git-svn.html

Sam Saffron almost 12 years ago

Thanks heaps Konstantin, seems to work fine

Bryan almost 12 years ago

This is why I always cringe when I come across an accepted answer on SO that says something like, â€œYeah, I figured out the solution and posted it on my blog.â€

Once the blog dies (and brother, that blog always dies quick), the answer is useless.

I don't know why people answer that way and why â€œanswersâ€ like that are accepted. Sigh.

Bryan almost 12 years ago

Okay, it is freaking TWO THOUSAND AND TWELVE. Why does this commenting system not have Preview? (or Edit after the fact)

Why?

If it is too much trouble to do it right, the use Disqus, for crissakes.

Sam Saffron almost 12 years ago

I admit, it is a bit ghetto, will sort it out

Codes_In_Chaos almost 12 years ago

How do you handle redirects to a status 200 page when the page isn't found?

Those are pretty common, and come in at least two flavors:

Redirect to the root directory
Redirect to a dedicated error page

I'd probably check if the link we want to test, and a randomly generated url get redirected to the same page.

@Bryan I'll take a minimal comment system like this over Disqus any time.

Sam Saffron over 10 years ago

As a founder of Discourse I thought it would be apt I used Discourse here: Discourse as my blogging platform

Sam Saffron