A day in the life of a slow page at Stack Overflow
over 13 years ago
In this post I would like to walk through our internal process of tuning a particular page on Stack Overflow.
In this example I will be looking at our badge detail page. It is not the most important page on the site, however it happens to showcase quite a few issues.
Step 1, Do we even have a problem?
As a developer, when I browse through the sites I can see how long a page takes to render.
Furthermore, I can break down this time and figure out what is happening on the page:
To make our lives even easier, I can see a timeline view of all the queries that executed on the page (at the moment as an html comment):
This is made possible using our mini profiler and a private implementation of DbConnection
, DbCommand
and DbDataReader
. Our implementations of the raw database access constructs provides us with both tracking and extra protection (If we figure out a way of packaging this we will be happy to open source).
When I looked at this particular page I quickly noticed a few concerning facts.
- It looks like we are running lots of database queries, our old trusty friend N+1.
- I noticed a very concerning fact, more than half of the time is spent on the web server.
- We are running some expensive queries, in particular a count that takes 11ms and a select that takes about 50ms.
At this point, I need to decide if it makes sense to spend any more effort here. There may be bigger and more important things I need to work on.
Since we store all of the haproxy logs (and retain them for a month), I am able to see how long it takes to render any page on average, as well as how many times this process runs per day.
select COUNT(*), AVG(Tr) from
dbo.Log_2011_05_01
where Uri like '/badges/%'
----------- -----------
26834 532
This particular family of pages is accessed 26k times a day; it takes us, on average, 532ms to render. Ouch … pretty painful. Nonetheless, I should probably look at other areas; this is not the most heavily accessed area in the site.
That said, Google takes into account the speed it takes to access your pages, if your page render time is high, it will affect your Google Ranking.
I wrote this mess. I know it can be done way faster.
Code review
The first thing I do is a code review,
var badges = from u2b in DB.Users2Badges
join b in DB.Badges on u2b.BadgeId equals b.Id
from post in DB.Posts.Where(p => p.Id == u2b.ReasonId && b.BadgeReasonType == BadgeReasonType.Post && p.DeletionDate == null).DefaultIfEmpty()
from tag in DB.Tags.Where(t => t.Id == u2b.ReasonId && b.BadgeReasonType == BadgeReasonType.Tag).DefaultIfEmpty()
where u2b.BadgeId == badge.Id
orderby u2b.Date descending
select new BadgesViewModel.BadgeInfo
{
Users2Badge = u2b,
Post = post,
Tag = tag,
User = u2b.User,
Badge = badge
};
var paged = badges.ToPagedList(page.Value, 60);
A typical LINQ-2-SQL multi join. ToPagedList performs a Count, and offsets the results to the particular page we are interested using Skip
and Take
.
Nothing horribly wrong here, in fact this probably considered a best practice.
However as soon as we start measuring I quickly notice that even though the SQL for this on local only takes 12 or so milliseconds, the total time it takes to execute the above code is much higher, profiling the above block shows that a 90ms execution time.
The LINQ-2-SQL abstraction is leaking 78ms of performance, OUCH. This is because it needs to generate a SQL statement from our fancy inline LINQ. It also needs to generate deserializers for our 5 objects. 78ms of overhead for 60 rows, to me seems unreasonable.
Not only did performance leak here, some ungodly SQL was generated … twice:
Dapper to the rescue
The first thing I set to do is rewrite the same query in raw SQL using dapper. Multi mapping support means it is really easy to do such a conversion.
var sql = @"select ub.*, u.*, p.*, t.* from
(
select *, row_number() over (order by Date desc) Row
from Users2Badges
where BadgeId = @BadgeId
) ub
join Badges b on b.Id = ub.BadgeId
join Users u on u.Id = ub.UserId
left join Posts p on p.Id = ub.ReasonId and b.BadgeReasonTypeId = 2
left join Tags t on t.Id = ub.ReasonId and b.BadgeReasonTypeId = 1
where Row > ((@Page - 1) * @PerPage) and Row <= @Page * @PerPage
order by ub.Date desc
";
var countSql = @"select count(*) from Users2Badges ub join Users u on u.Id = ub.UserId where BadgeId = @Id");
var rows = Current.DB.Query<Users2Badge, User, Post, Tag, BadgesViewModel.BadgeInfo>(sql, (ub, u, p, t) => new BadgesViewModel.BadgeInfo
{
Users2Badge = ub,
User = u,
Post = p,
Tag = t,
Badge = badge
}, new { Page = page, BadgeId = badge.Id, PerPage = 60 }).ToList();
var count = Current.DB.Query<int>(countSql, new { Id = id.Value }).First();
I measure 0.2ms of performance leak in dapper … sometimes. Sometimes it is too close to count. We are making progress. Personally, I also find this more readable and debuggable.
This conversion comes with a multitude of wonderful side effects. I now have SQL I can comfortably work with in SQL Management Studio. In SQL Management Studio, I am superman. I can mold the query into the shape I want it and I can profile it. Also the count query is way simpler than the select query.
We are not done quite yet
Removing the ORM overhead is one thing, however we also had a few other problems. I noticed that in production this query was taking way longer. I like to keep my local DB in sync with production so I can experience similar pains locally. So this smells of a bad execution plan being cached. This usually happens cause indexes are missing. I quickly run the query with an execution plan and see:
And one big FAT line:
Looks like the entire clustered index on the Users2Badges table is being scanned just to pull out the first page. Not good. When you see stuff like that it also introduces erratic query performance; What will happen when the query needs to look at a different badge or different page? This query may be “optimal” given the circumstances, for page 1 of this badge. However, it may be far from optimal for a badge that only has 10 rows in the table. An index scan + bookmark lookup may do the trick there.
In production, depending on who hit what page first for which badge, you are stuck with horrible plans that lead to erratic behavior. The fix is NOT to put in OPTION (RECOMPILE)
it is to correct the indexing.
So, SQL Server tells me it wants an index on Users2Badges(BadgeId)
that includes (Id,UserId,Date,Comment,ReasonId)
. I will try anything once, so I give it a shot, and notice there is absolutely no performance impact. This index does not help.
Recommendation from the execution plan window are often wrong
We are trying to pull out badge records based on badge id ordered on date. So the correct index here is:
CREATE NONCLUSTERED INDEX Users2Badges_BadgeId_Date_Includes
ON [dbo].[Users2Badges] ([BadgeId], [Date])
INCLUDE ([Id],[UserId],[Comment],[ReasonId])
This new index cuts SQL time on local by a few milliseconds, but more importantly, it cuts down the time it takes to pull page 100 by a factor of 10. The same plan is now used regardless of Page or Badge from a cold start.
Correcting the N+1 issue.
Lastly, I notice 50-60ms spent on an N+1 query on my local copy, fixing this is easy. Looks like we are pulling parent posts for the posts associated with the badges, the profiler shows me that this work is deferred to the View on a per row basis. I quickly change the ViewModel to pull in the missing posts and add a left join to the query to pull them in. Easy enough.
Conclusion
After deploying these fixes in production, load time on the page went down from 630ms to 40ms, for pages deep in the list it is now over 100x faster.
Tuning SQL is key, the simple act of tuning it reduced the load time for the particular page by almost 50% in production. However, having a page take 300ms just because your ORM is inefficient is not excusable. The page is now running at the totally awesome speed of 20-40ms render time. In production. ORM inefficiency cost us a 10x slowdown.
Yup – tuning SQL is the key. It's one of those things that many developers don't realise, and why I'm very happy to be working the SQL space. I'm also very proud to think that a query of mine is in the StackOverflow codebase. :)
http://stackoverflow.com/questions/1176011/sql-to-determine-minimum-sequential-days-of-access/1176255#1176255
Rob