Speeding up Rails 4.2

over 9 years ago

Recently Godfrey Chan got Discourse working on Rails Master. It was a rather long task that involved some changes to Discourse internals and some changes to Rails internals.

Knowing Rails 4.2 was just around the corner I decided that it seemed like the perfect time to see how performance is. Seeing Rails 4.2 contains the adequate record patches, I was super eager to see how Discourse faired.

###The Discourse benchmark

Answering the question of “how fast is Ruby?” or “how fast is Rails?” is something people usually answer with the following set of steps.

  1. Upgrade production to a certain version of Rails
  2. Look at New Relic reports
  3. Depending on how performance is either complain on Twitter and/or Blog or write a few kind words.

The trouble is that performance testing after release is the absolute worst time. Code is in production, fixing stuff is tricky, rolling back often impractical.

At Discourse I developed the Discourse benchmark. It is a simple script that loads up a few thousand users and topics and then proceeds to measure performance of various hot spots using apache bench.

I have found that the results of the benchmark are strikingly similar to real-world performance we see. If the benchmark is faster, it is very likely that it will also be faster in production. We use this benchmark to test Ruby before major Ruby releases. Koichi used this benchmark to help optimise the current GC algorithms in MRI.

It is often tempting to look at micro benchmarks when working on performance issues. Micro benchmarks are a great tool, but MUST be followed with bigger macro benchmarks to see the real impact. 1000% speedup for a routine that is called once a day in a background job has significantly less impact than 0.5% improvement to a routing that is called on every web request.

###How was Rails 4.2 Looking?

For over a year now, Discourse had the ability to dual boot. This allowed me to quickly run a simple benchmark to see where we were at (arel , rails):

% RAILS_MASTER=1 ruby script/bench.rb

name: 
  [percentile]: [time]

categories_admin:
  50: 123
  75: 136
  90: 147
  99: 224
home_admin:
  50: 111
  75: 122
  90: 147
  99: 224
topic_admin:
  50: 69
  75: 81
  90: 89
  99: 185
categories:
  50: 81
  75: 126
  90: 138
  99: 211
home:
  50: 53
  75: 63
  90: 100
  99: 187
topic:
  50: 20
  75: 21
  90: 23
  99: 77
timings:
  load_rails: 3052
ruby-version: 2.1.2-p95
rss_kb: 241604

VS.

categories_admin:
  50: 62
  75: 66
  90: 77
  99: 193
home_admin:
  50: 51
  75: 53
  90: 55
  99: 175
topic_admin:
  50: 27
  75: 27
  90: 28
  99: 87
categories:
  50: 53
  75: 55
  90: 98
  99: 173
home:
  50: 35
  75: 37
  90: 55
  99: 154
topic:
  50: 12
  75: 12
  90: 13
  99: 66
timings:
  load_rails: 3047
ruby-version: 2.1.2-p95
rss_kb: 263948

Across the board we were running at around half speed. Pages that were taking 12ms today were taking 20ms on Rails 4.2.

Given these results I decided to take some time off my regular work and try to isolate what went wrong. Contribute a few patches and help the Rails team correct the issue prior to the impending RC release.

Before I continue I think it is worth mentioning that the lion’s share of the performance optimisations we ended up adding were authored by Sean Griffin so a big huge thanks to him.

###Cracking open the black box

I find flame graphs are instrumental at quickly discovering what is going on. Flame graphs in Ruby 2.1 now have better fidelity than they did in 2.0 thanks to feeding directly off stackprof.

My methodology is quite simple.

  1. Go to a page in Rails 4.1.8 mode, reload a few times, then tack on ?pp=flamegraph to see a flamegraph
  2. Repeat the process on Rails 4.2
  3. Zoom in on similar area and look for performance regressions.

For example here is the before graph on Rails 4.1.8 (home page)

rails_baseline.html (214.9 KB)

Compared to the after graph on Rails Master (home page)

rails_master_v1.html (253.6 KB)

The first point of interest is the sample count:

Rails 4.1.8

vs.

Master

Flamgraph attempts to grab a sample every 0.5ms of wall time, the reality is that due to depth of stack, timers and so on you end up getting a full sample every 0.5ms - 2ms. That said we can use the sample counts a good rough estimate of execution time.

It looks like Rails master is taking up 108 samples where current Rails was taking 74, a 46% increase in the above example.

We can see a huge amount of the additional overhead is spent in arel.

Next we zoom in to particular similar areas and compare:

Rails 4.1.8

Master

We can see expensive calls to reorder_bind_params that we were not seeing in Rails 4.2

Clicking on the frame reveals:

So 14.5% of our cost is in this method.

Additionally we see other new areas of overhead:

Like a method missing being invoked when casting DB values

We repeat the process for various pages in our app, and discover that 3.2% of the time on the categories page is spent in ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key and a pile of other little cuts.

With this information we can start working on various fixes:

There are a lot of little cuts and some big cuts, here are some examples

The list goes on, slowly but surely we regain performance.

###Memory profiling

Another very important tool in our arsenal is memory profiler. If we include it in our Gemfile, rack-mini-profiler will wire it up for us, then appending pp=profile-gc-ruby-head to any page gives us a memory report (note this requires Ruby 2.1 as it provides the new APIs needed):

It reveals huge allocation increase in Rails master, which was allocating 131314 objects on the front page in Rails master vs 17651 object in Rails 4.1.8

We can dig down and discover a huge addition of allocations in arel and activerecord

arel/lib x 107607
activerecord/lib x 12778

We can see the vast majority is being allocated at:

arel/lib/arel/visitors/visitor.rb x 106435

We can track down the 3 particular lines:

arel/lib/arel/visitors/visitor.rb:11 x 74432
lib/arel/visitors/visitor.rb:13 x 22880
lib/arel/visitors/visitor.rb:12 x 7040

This kind of analysis helps us fix issues like:

and

It is very important to be able to prove you have a real issue before applying optimisations blindly. Memory profiler allows us to pinpoint specifics and demonstrate the actual gain.

It is fascinating to see which particular strings are allocated where. For example the empty string is allocated 702 times during a front page load:

"" x 702
    activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb:433 x 263
    activerecord/lib/active_record/type/boolean.rb:11 x 153
    omniauth-1.2.2/lib/omniauth/strategy.rb:396 x 84

We can work through this information, and paper cut by paper cut eliminate a large number of allocations resulting in better performance.

Where are we with Rails 4.2 RC1

categories_admin:
  50: 69
  75: 74
  90: 86
  99: 165
home_admin:
  50: 55
  75: 57
  90: 66
  99: 133
topic_admin:
  50: 27
  75: 27
  90: 28
  99: 108
categories:
  50: 56
  75: 58
  90: 60
  99: 192
home:
  50: 39
  75: 40
  90: 42
  99: 165
topic:
  50: 13
  75: 13
  90: 14
  99: 75

So it looks like we are more or less 10% slower that 4.1 which is a huge improvement over where we were a couple of weeks ago (considering we were almost 100% slower).

There is still plenty of work left to do, but we are on the right path.

A taste of things to come

Lars Kanis a maintainer of the pg gem has been working for a while now on a system for natively casting values in the pg gem. Up till now the pg gem has followed libpq quite closely. When you query the DB using libpq you are always provided with strings. Historically the pg gem just echos it back to the Ruby program.

So for example:

 ActiveRecord::Base.connection.raw_connection.exec("select 1").values[0][0]
=> "1"

The new pg release allows us to get back a number here, as opposed to a string. This helps us avoid allocating, converting and garbage collecting a string, just to get a number from our database. The cost of these conversions adds up quite a lot. It is much cheaper to do conversions in C code and avoid the allocation of objects that the Ruby GC needs to track.

cnn = ActiveRecord::Base.connection.raw_connection; puts
cnn.type_map_for_results = PG::BasicTypeMapForResults.new cnn; puts
cnn.exec('select 1').values[0][0]
=> 1

The tricky thing though is that people heavily over-query when using Active Record.

# no need to perform conversions for created_at and updated_at
car_color = Car.first.color

# above is more efficiently written as car_color = Car.limit(1).pluck(:color).first

If you perform conversions upfront for columns people never consume you can end up with a severe performance hit. In particular converting times is a very expensive operation which involves string parsing.

In my benchmarking it is showing a 5-10% improvement for the Discourse bench which in almost all cases means we are faster than Rails 4.1.

I think it is fairly likely that we can be 10-20% faster than Rails 4.1 by the time Rails 5 ships provided we watch the benchmarks closely and keep optimising.

Side Effects

A very interesting and rewarding side effect of this work was a series of optimisations I was able to apply to Discourse while looking at all the data.

In particular I built us a distributed cache powered by the message_bus gem. A Hash like structure that is usable from any process running Discourse (Discourse instances can span multiple machines). Using it I was able to apply caching to 3 places that we sorely needed to and avoid expensive database work.

Conclusion

The recent performance work strongly highlights the importance of a long running benchmark. There is no single magical method we can fix that will recover a year of various performance regressions. The earlier we pick up on regressions the easier they are to fix.

It also highlight the “paper cut” nature of this kind of work. It is easy to be flippant and say that 1% here and 0.5% percent there does not matter. However, repeat this process enough and you have a product that is 100% slower. We need to look at trends to see where we are at. Slow down dev and work on performance when performance regresses.

The Rails team did an admirable job at addressing a lot of the perf regressions we found in very short order. Sean did some truly impressive work. Godfrey did a spectacular job getting Discourse running on Rails master.

However, I am very sad that we do not have long running benchmark for either Ruby or Rails and worry that if we don’t we are bound to repeat this story again.

Comments

Michael Cohen over 9 years ago
Michael Cohen

Thanks for this post, super useful - a bunch of tricks I didn’t know.

.limit(1) isn’t necessary with .first, as first does ORDER BY "cars"."id" ASC LIMIT 1. similarly you can give .first or .last a parameter to limit by some amount, like .first(3) to get the first 3 - so .limit is basically entirely unnecessary.

Sam Saffron over 9 years ago
Sam Saffron

I was being pretty careful about my example here:

require 'benchmark/ips'
require File.expand_path("../../config/environment", __FILE__)

conn = ActiveRecord::Base.connection.raw_connection

Benchmark.ips do |b|
  b.report("simple") do
    User.first.name
  end

  b.report("simple with select") do
    User.select("name").first.name
  end

  b.report("pluck with first") do
    User.pluck(:name).first
  end

  b.report("pluck with limit") do
    User.limit(1).pluck(:name).first
  end

  b.report("raw") do
    conn.exec("SELECT name FROM users LIMIT 1").getvalue(0,0)
  end
end
-------------------------------------------------
              simple      1.670k (± 8.8%) i/s -      8.379k
  simple with select      2.471k (± 7.4%) i/s -     12.349k
    pluck with first     27.217  (±29.4%) i/s -    120.000 
    pluck with limit      2.542k (±11.1%) i/s -     12.546k
                 raw      9.302k (± 4.4%) i/s -     46.854k

So:

 conn.exec("SELECT name FROM users LIMIT 1").getvalue(0,0)

Is still 4 times faster than doing it in the most efficient way in AR, but pluck with limit is the fastest AR has to offer for this problem.

Sean Griffin over 9 years ago
Sean Griffin

Challenge accepted!!

Michael Cohen over 9 years ago
Michael Cohen

Ah whoops, I didn’t notice that in your case your .first is on the array returned by .pluck, rather than one from ActiveRecord. Sorry!

Sam Saffron over 9 years ago
Sam Saffron

No problem at all, let me know if you have any other questions / clarifications needed! :slight_smile:

Paul Kmiec over 9 years ago
Paul Kmiec

Wow. This is surprising given that 4.2 was hyped as faster due to AdequateRecord. I’m assuming these performance numbers would be even worse without AdequateRecord?

On that note, could you tell how much AdequateRecord is helping real applications by running numbers with AdequateRecord on and with its cache off (perhaps that requires monkey patch)?

Thanks for your work.

Sam Saffron over 9 years ago
Sam Saffron

AdequateRecord seems to be doing a pretty good job with trivial queries, however once they get complicated the gain is a lot less.

Additionally in certain cases you may want to bypass it altogether, see: PostgreSQL: Re: [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4

I am not sure how we would go about disabling and benching, will ask Aaron.

An example of what I see is:

require 'active_support'
require 'active_record'
require 'benchmark/ips'

p ActiveRecord::VERSION::STRING

ActiveRecord::Base.establish_connection adapter: 'postgresql', database: 'discourse_development'

ActiveRecord::Base.connection.raw_connection.exec('drop table people') rescue nil

ActiveRecord::Base.connection.instance_eval do
  create_table(:people) { |t| t.string :name; t.string :email; t.integer :age; t.timestamps }
end

class Person < ActiveRecord::Base; end

person = Person.create! name: 'Aaron', email: 'something@something.com', age: 18

id   = person.id
name = person.name
email = person.email

Benchmark.ips do |x|
  x.report('find')         { Person.find id }
  x.report('find_by_name') { Person.find_by_name name }
  x.report('find by name') { Person.find_by(name: name) }
  x.report('find where name limit 1') { Person.where(name: name).limit(1).to_a }
  x.report('find where name email limit 1 offset') { Person.where(name: name).where(email: email).limit(1).offset(0).to_a }
  x.report('first person') { Person.first }
end

Rails 4.1.8

-------------------------------------------------
                find      3.239k (± 2.0%) i/s -     16.380k
        find_by_name      2.645k (± 2.7%) i/s -     13.409k
        find by name      2.628k (± 2.2%) i/s -     13.356k
find where name limit 1
                          2.628k (± 2.7%) i/s -     13.356k
find where name email limit 1 offset
                          2.257k (± 4.4%) i/s -     11.440k
        first person      2.897k (± 2.1%) i/s -     14.575k

Rails 4.2.0-rc1

-------------------------------------------------
                find      4.633k (± 1.7%) i/s -     23.490k
        find_by_name      4.567k (± 1.5%) i/s -     22.843k
        find by name      4.515k (± 2.0%) i/s -     22.843k
find where name limit 1
                          2.980k (± 6.3%) i/s -     14.841k
find where name email limit 1 offset
                          2.653k (± 2.8%) i/s -     13.260k
        first person      2.768k (± 2.2%) i/s -     14.092k

As soon as queries get a bit involved the perf boost goes way down, in Discourse we usually run fairly complex queries involving lots of data so are seeing less gain here.

So for example find_by(name:) on itself is say 80% faster but once you start to involve lots of condition (limit,offset,where) the gain becomes smaller.

Sean Griffin over 9 years ago
Sean Griffin

Only find and find_by are cached. We don’t cache Relation instances, so code using where, or default_scope is not going to see the performance bump from adequate record caching.

Sam Saffron over 9 years ago
Sam Saffron

Yeah, Discourse has a default scope on almost all models:

Any model (which is pretty much all models) that shares this concern gets no caching at the moment :frowning:

Lephyrius over 9 years ago
Lephyrius

Has someone tried the benchmark on something like Sequel?
Are Sequel faster than ActiveRecord?

Im pretty concerned about the performance of my Rails app now.

Brian Hempel over 9 years ago
Brian Hempel

Does http://www.isrubyfastyet.com/ not count as a long-running Ruby benchmark for your purposes because it’s not fine-grained enough?

Admittedly, IRFY has a bad habit of falling behind. Right now, it’s stuck because after upgrading to OS X 10.9, Ruby performance dropped inexplicably. And this morning, the display on the benchmarking machine died; I may or may not be able to press it in the right place to cure the short… :slight_smile:

Sam Saffron over 9 years ago
Sam Saffron

First of all, regardless of all the talking etc, you have something real out there which definitely counts for a lot.

However, I envision a much more targeted long running benchmark.

The vast majority of work we find we are doing at Discourse server side is ActiveRecord, the actual test running there is not really simulating a real app, cause there are very few apps out there that can do more than say 20 reqs/s in production.

Additionally you are not tracking how rails is progressing, is rails getting faster or slower?

At the moment there are 2 tests I would love to see:

  1. Discourse bench over time on Ruby head
  2. Discourse bench over time on Rails head

Additionally to help nail down particular issues in Ruby I would like to see

  1. Ruby bench ( https://github.com/ruby/ruby/tree/trunk/benchmark ) over time on Ruby head.

Comparing implementations is something that I think is less critical at the moment.

Brian Hempel over 9 years ago
Brian Hempel

Is the Discourse benchmark stable?

I’m actually fairly open to switching up the benchmarking right now. The historical data is nice, but I’m not having a lot of success figuring out what happened between Mac OS X 10.6 and 10.9 that made the benchmark run slower. (It may be something in the TCP stack, Ruby is using something like 95% CPU while “discoveryd” is using 5%). If IRFY has to break with historical data, might as well switch to a more standardized benchmark.

I haven’t spend much time with the Ruby benchmark suite. I’m guessing that, for that to be useful, you’d want results for each individual benchmark. Maybe that will be a good Christmas-time project…

Sam Saffron over 9 years ago
Sam Saffron

Discourse benchmark is pretty stable at the moment.

Be sure to pop in on this topic

A few others are experimenting with this at the moment and it would be better if there was one central effort.

Claudiu over 9 years ago
Claudiu

Is better to update to 4.2 or wait for minor releases and speed improvements?

For the moment we are trying to cleanup queries from AR by just bypassing some AR relation and millions of objects that are created. Still believe that AR has become to big and fat for what is offered

merry christmas! :smile:

Sam Saffron over 9 years ago
Sam Saffron

My recommendation would be to upgrade to 4.2 if there are particular features and fixes in 4.2 you are eager to use.


comments powered by Discourse