Speeding up Rails 4.2
almost 10 years ago
Recently Godfrey Chan got Discourse working on Rails Master. It was a rather long task that involved some changes to Discourse internals and some changes to Rails internals.
Knowing Rails 4.2 was just around the corner I decided that it seemed like the perfect time to see how performance is. Seeing Rails 4.2 contains the adequate record patches, I was super eager to see how Discourse faired.
###The Discourse benchmark
Answering the question of “how fast is Ruby?” or “how fast is Rails?” is something people usually answer with the following set of steps.
- Upgrade production to a certain version of Rails
- Look at New Relic reports
- Depending on how performance is either complain on Twitter and/or Blog or write a few kind words.
The trouble is that performance testing after release is the absolute worst time. Code is in production, fixing stuff is tricky, rolling back often impractical.
At Discourse I developed the Discourse benchmark. It is a simple script that loads up a few thousand users and topics and then proceeds to measure performance of various hot spots using apache bench.
I have found that the results of the benchmark are strikingly similar to real-world performance we see. If the benchmark is faster, it is very likely that it will also be faster in production. We use this benchmark to test Ruby before major Ruby releases. Koichi used this benchmark to help optimise the current GC algorithms in MRI.
It is often tempting to look at micro benchmarks when working on performance issues. Micro benchmarks are a great tool, but MUST be followed with bigger macro benchmarks to see the real impact. 1000% speedup for a routine that is called once a day in a background job has significantly less impact than 0.5% improvement to a routing that is called on every web request.
###How was Rails 4.2 Looking?
For over a year now, Discourse had the ability to dual boot. This allowed me to quickly run a simple benchmark to see where we were at (arel , rails):
% RAILS_MASTER=1 ruby script/bench.rb
name:
[percentile]: [time]
categories_admin:
50: 123
75: 136
90: 147
99: 224
home_admin:
50: 111
75: 122
90: 147
99: 224
topic_admin:
50: 69
75: 81
90: 89
99: 185
categories:
50: 81
75: 126
90: 138
99: 211
home:
50: 53
75: 63
90: 100
99: 187
topic:
50: 20
75: 21
90: 23
99: 77
timings:
load_rails: 3052
ruby-version: 2.1.2-p95
rss_kb: 241604
VS.
categories_admin:
50: 62
75: 66
90: 77
99: 193
home_admin:
50: 51
75: 53
90: 55
99: 175
topic_admin:
50: 27
75: 27
90: 28
99: 87
categories:
50: 53
75: 55
90: 98
99: 173
home:
50: 35
75: 37
90: 55
99: 154
topic:
50: 12
75: 12
90: 13
99: 66
timings:
load_rails: 3047
ruby-version: 2.1.2-p95
rss_kb: 263948
Across the board we were running at around half speed. Pages that were taking 12ms today were taking 20ms on Rails 4.2.
Given these results I decided to take some time off my regular work and try to isolate what went wrong. Contribute a few patches and help the Rails team correct the issue prior to the impending RC release.
Before I continue I think it is worth mentioning that the lion’s share of the performance optimisations we ended up adding were authored by Sean Griffin so a big huge thanks to him.
###Cracking open the black box
I find flame graphs are instrumental at quickly discovering what is going on. Flame graphs in Ruby 2.1 now have better fidelity than they did in 2.0 thanks to feeding directly off stackprof.
My methodology is quite simple.
- Go to a page in Rails 4.1.8 mode, reload a few times, then tack on
?pp=flamegraph
to see a flamegraph - Repeat the process on Rails 4.2
- Zoom in on similar area and look for performance regressions.
For example here is the before graph on Rails 4.1.8 (home page)
rails_baseline.html (214.9 KB)
Compared to the after graph on Rails Master (home page)
rails_master_v1.html (253.6 KB)
The first point of interest is the sample count:
Rails 4.1.8
vs.
Master
Flamgraph attempts to grab a sample every 0.5ms of wall time, the reality is that due to depth of stack, timers and so on you end up getting a full sample every 0.5ms - 2ms. That said we can use the sample counts a good rough estimate of execution time.
It looks like Rails master is taking up 108 samples where current Rails was taking 74, a 46% increase in the above example.
We can see a huge amount of the additional overhead is spent in arel.
Next we zoom in to particular similar areas and compare:
Rails 4.1.8
Master
We can see expensive calls to reorder_bind_params
that we were not seeing in Rails 4.2
Clicking on the frame reveals:
So 14.5% of our cost is in this method.
Additionally we see other new areas of overhead:
Like a method missing being invoked when casting DB values
We repeat the process for various pages in our app, and discover that 3.2% of the time on the categories page is spent in ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key
and a pile of other little cuts.
With this information we can start working on various fixes:
There are a lot of little cuts and some big cuts, here are some examples
The list goes on, slowly but surely we regain performance.
###Memory profiling
Another very important tool in our arsenal is memory profiler. If we include it in our Gemfile, rack-mini-profiler will wire it up for us, then appending pp=profile-gc-ruby-head
to any page gives us a memory report (note this requires Ruby 2.1 as it provides the new APIs needed):
It reveals huge allocation increase in Rails master, which was allocating 131314 objects on the front page in Rails master vs 17651 object in Rails 4.1.8
We can dig down and discover a huge addition of allocations in arel and activerecord
arel/lib x 107607
activerecord/lib x 12778
We can see the vast majority is being allocated at:
arel/lib/arel/visitors/visitor.rb x 106435
We can track down the 3 particular lines:
arel/lib/arel/visitors/visitor.rb:11 x 74432
lib/arel/visitors/visitor.rb:13 x 22880
lib/arel/visitors/visitor.rb:12 x 7040
This kind of analysis helps us fix issues like:
and
It is very important to be able to prove you have a real issue before applying optimisations blindly. Memory profiler allows us to pinpoint specifics and demonstrate the actual gain.
It is fascinating to see which particular strings are allocated where. For example the empty string is allocated 702 times during a front page load:
"" x 702
activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb:433 x 263
activerecord/lib/active_record/type/boolean.rb:11 x 153
omniauth-1.2.2/lib/omniauth/strategy.rb:396 x 84
We can work through this information, and paper cut by paper cut eliminate a large number of allocations resulting in better performance.
Where are we with Rails 4.2 RC1
categories_admin:
50: 69
75: 74
90: 86
99: 165
home_admin:
50: 55
75: 57
90: 66
99: 133
topic_admin:
50: 27
75: 27
90: 28
99: 108
categories:
50: 56
75: 58
90: 60
99: 192
home:
50: 39
75: 40
90: 42
99: 165
topic:
50: 13
75: 13
90: 14
99: 75
So it looks like we are more or less 10% slower that 4.1 which is a huge improvement over where we were a couple of weeks ago (considering we were almost 100% slower).
There is still plenty of work left to do, but we are on the right path.
A taste of things to come
Lars Kanis a maintainer of the pg gem has been working for a while now on a system for natively casting values in the pg gem. Up till now the pg gem has followed libpq quite closely. When you query the DB using libpq you are always provided with strings. Historically the pg gem just echos it back to the Ruby program.
So for example:
ActiveRecord::Base.connection.raw_connection.exec("select 1").values[0][0]
=> "1"
The new pg release allows us to get back a number here, as opposed to a string. This helps us avoid allocating, converting and garbage collecting a string, just to get a number from our database. The cost of these conversions adds up quite a lot. It is much cheaper to do conversions in C code and avoid the allocation of objects that the Ruby GC needs to track.
cnn = ActiveRecord::Base.connection.raw_connection; puts
cnn.type_map_for_results = PG::BasicTypeMapForResults.new cnn; puts
cnn.exec('select 1').values[0][0]
=> 1
The tricky thing though is that people heavily over-query when using Active Record.
# no need to perform conversions for created_at and updated_at
car_color = Car.first.color
# above is more efficiently written as car_color = Car.limit(1).pluck(:color).first
If you perform conversions upfront for columns people never consume you can end up with a severe performance hit. In particular converting times is a very expensive operation which involves string parsing.
In my benchmarking it is showing a 5-10% improvement for the Discourse bench which in almost all cases means we are faster than Rails 4.1.
I think it is fairly likely that we can be 10-20% faster than Rails 4.1 by the time Rails 5 ships provided we watch the benchmarks closely and keep optimising.
Side Effects
A very interesting and rewarding side effect of this work was a series of optimisations I was able to apply to Discourse while looking at all the data.
In particular I built us a distributed cache powered by the message_bus gem. A Hash like structure that is usable from any process running Discourse (Discourse instances can span multiple machines). Using it I was able to apply caching to 3 places that we sorely needed to and avoid expensive database work.
Conclusion
The recent performance work strongly highlights the importance of a long running benchmark. There is no single magical method we can fix that will recover a year of various performance regressions. The earlier we pick up on regressions the easier they are to fix.
It also highlight the “paper cut” nature of this kind of work. It is easy to be flippant and say that 1% here and 0.5% percent there does not matter. However, repeat this process enough and you have a product that is 100% slower. We need to look at trends to see where we are at. Slow down dev and work on performance when performance regresses.
The Rails team did an admirable job at addressing a lot of the perf regressions we found in very short order. Sean did some truly impressive work. Godfrey did a spectacular job getting Discourse running on Rails master.
However, I am very sad that we do not have long running benchmark for either Ruby or Rails and worry that if we don’t we are bound to repeat this story again.
Thanks for this post, super useful - a bunch of tricks I didn’t know.
.limit(1)
isn’t necessary with.first
, as first doesORDER BY "cars"."id" ASC LIMIT 1
. similarly you can give.first
or.last
a parameter to limit by some amount, like.first(3)
to get the first 3 - so.limit
is basically entirely unnecessary.