One of the very noble goals the Ruby community which is being spearheaded by Matz is the Ruby 3x3 plan. The idea is that using large amounts of modern optimizations we can make Ruby the interpreter 3 times faster. It is an ambitious goal, which is notable and inspiring. This “movement” has triggered quite a lot of interesting experiments in Ruby core, including a just-in-time compiler and action around reducing memory bloat out-of-the-box. If Ruby gets faster and uses less memory, then everyone gets free performance, which is exactly what we all want.

A big problem though is that there is only so much magic a faster Ruby can achieve. A faster Ruby is not going to magically fix a “bubble sort” hiding deep in your code. Active Record has tons of internal waste that ought to be addressed which could lead to the vast majority of Ruby applications in the wild getting a lot faster. Rails is the largest consumer of Ruby after all and Rails is underpinned by Active Record.

Sadly, Active Record performance has not gotten much better since the days of Rails 2, in fact in quite a few cases it got slower or a lot slower.

Active Record is very wasteful

I would like to start off with a tiny example:

Say I have a typical 30 column table containing Topics.

If I run the following, how much will Active Record allocate?

a = []
Topic.limit(1000).each do |u|
   a << u.id
end
Total allocated: 3835288 bytes (26259 objects)

Compare this to an equally inefficient “raw version”.

sql = -"select * from topics limit 1000"
ActiveRecord::Base.connection.raw_connection.async_exec(sql).column_values(0)
Total allocated: 8200 bytes (4 objects)

This amount of waste is staggering, it translates to deadly combo:

  • Extreme levels of memory usage

and

  • Slower performance

But … that is really bad Active Record!

An immediate gut reaction here is that I am “cheating” and writing “slow” Active Record code, and comparing it to mega optimized raw code.

One could argue that I should write:

a = []
Topic.select(:id).limit(1000).each do |u|
  a << u.id
end

In which you would get:

Total allocated: 1109357 bytes (11097 objects)

Or better still:

Topic.limit(1000).pluck(:id) 

In which I would get

Total allocated: 221493 bytes (5098 objects)

Time for a quick recap.

  • The “raw” version allocated 4 objects, it was able to return 1000 Integers directly which are not allocated indevidually in the Ruby heaps and are not subject to garbage collection slots.

  • The “naive” Active Record version allocates 26259 objects

  • The “slightly optimised” Active Record version allocates 11097 objects

  • The “very optimised” Active Record version allocates 5098 objects

All of those numbers are orders of magnitude larger than 4.

How many objects does a “naive/lazy” implementation need to allocate?

One feature that Active Record touts as a huge advantage over Sequel is the “built-in” laziness.

ActiveRecord will not bother “casting” a column to a date till you try to use it, so if for any reason you over select ActiveRecord has your back. This deficiency in Sequel is acknowledged and deliberate:

This particular niggle makes it incredibly hard to move to Sequel from ActiveRecord without extremely careful review, despite Sequel being so incredibly fast and efficient.

We have no “fastest” example out there of an efficient lazy selector. In our case we are consuming 1000 ids so we would expect the mega efficient implementation to allocate 1020 or so objects cause we can not get away without allocating a Topic object. We do not expect 26 thousand.

Here is a quick attempt at such an implementation: (note this is just proof of concept of the idea, not a production level system)

$conn = ActiveRecord::Base.connection.raw_connection

class FastBase

  class Relation
    include Enumerable

    def initialize(table)
      @table = table
    end

    def limit(limit)
      @limit = limit
      self
    end

    def to_sql
      sql = +"SELECT #{@table.columns.join(',')} from #{@table.get_table_name}"
      if @limit
        sql << -" LIMIT #{@limit}"
      end
      sql
    end

    def each
      @results = $conn.async_exec(to_sql)
      i = 0
      while i < @results.cmd_tuples
        row = @table.new
        row.attach(@results, i)
        yield row
        i += 1
      end
    end

  end

  def self.columns
    @columns
  end

  def attach(recordset, row_number)
    @recordset = recordset
    @row_number = row_number
  end

  def self.get_table_name
    @table_name
  end

  def self.table_name(val)
    @table_name = val
    load_columns
  end

  def self.load_columns
    @columns = $conn.async_exec(<<~SQL).column_values(0)
      SELECT COLUMN_NAME FROM information_schema.columns
      WHERE table_schema = 'public' AND
        table_name = '#{@table_name}'
    SQL

    @columns.each_with_index do |name, idx|
      class_eval <<~RUBY
        def #{name}
          if @recordset && !@loaded_#{name}
            @loaded_#{name} = true
            @#{name} = @recordset.getvalue(@row_number, #{idx})
          end
          @#{name}
        end

        def #{name}=(val)
          @loaded_#{name} = true
          @#{name} = val
        end
      RUBY
    end
  end

  def self.limit(number)
    Relation.new(self).limit(number)
  end
end

class Topic2 < FastBase
  table_name :topics
end

Then we can measure:

a = []
Topic2.limit(1000).each do |t|
   a << t.id
end
a
Total allocated: 84320 bytes (1012 objects)

So … we can manage a similar API with 1012 object allocations as opposed to 26 thousand objects.

Does this matter?

A quick benchmark shows us:

Calculating -------------------------------------
               magic    256.149  (± 2.3%) i/s -      1.300k in   5.078356s
                  ar     75.219  (± 2.7%) i/s -    378.000  in   5.030557s
           ar_select    196.601  (± 3.1%) i/s -    988.000  in   5.030515s
            ar_pluck      1.407k (± 4.5%) i/s -      7.050k in   5.020227s
                 raw      3.275k (± 6.2%) i/s -     16.450k in   5.043383s
             raw_all    284.419  (± 3.5%) i/s -      1.421k in   5.002106s

Our new implementation (that I call magic) does 256 iterations a second compared to Rails 75. It is a considerable improvement over the Rails implementation on multiple counts. It is both much faster and allocates significantly less memory leading to reduced process memory usage. This is despite following the non-ideal practice of over selection. In fact our implementation is so fast, it even beats Rails when it is careful only to select 1 column!

This is the Rails 3x3 we could have today with no changes to Ruby! :confetti_ball:

Another interesting data point is how much slower pluck, the turbo boosted version Rails has to offer, is slower that raw SQL. In fact, at Discourse, we monkey patch pluck exactly for this reason. (I also have a Rails 5.2 version)

Why is this bloat happening?

Looking at memory profiles I can see multiple reasons all this bloat happens:

  1. Rails is only sort-of-lazy… I can see 1000s of string allocations for columns we never look at. It is not “lazy-allocating” it is partial “lazy-casting”

  2. Every row allocates 3 additional objects for bookeeping and magic. ActiveModel::Attribute::FromDatabase, ActiveModel::AttributeSet, ActiveModel::LazyAttributeHash . None of this is required and instead a single array could be passed around that holds indexes to columns in the result set.

  3. Rails insists on dispatching casts to helper objects even if the data retrieved is already in “the right format” (eg a number) this work generates extra bookkeeping

  4. Every column name we have is allocated twice per query, this stuff could easily be cached and reused (if the query builder is aware of the column names it selected it does not need to ask the result set for them)

What should to be done?

I feel that we need to carefully review Active Record internals and consider an implementation that allocates significantly less objects per row. We also should start leveraging the PG gem’s native type casting to avoid pulling strings out of the database only to convert them back to numbers.

You can see the script I used for this evaluation over here:

Comments

Janko Marohnić almost 6 years ago
Janko Marohnić

Thanks for a very informative blog post!

I’ve extended the benchmark with Sequel:

Sequel additions
diff --git a/memory.rb b/memory.rb
index 2d9d5ee..4e32f6d 100644
--- a/memory.rb
+++ b/memory.rb
@@ -4,19 +4,32 @@ gemfile do
   source 'https://rubygems.org'
   gem 'pg'
   gem 'activerecord', '5.2.0'
+  gem 'sequel', '5.8.0'
   gem 'memory_profiler'
   gem 'benchmark-ips'
 end
 
 require 'active_record'
+require 'sequel'
 require 'memory_profiler'
 require 'benchmark/ips'
 
+system "createdb test_db"
+
 ActiveRecord::Base.establish_connection(
   :adapter => "postgresql",
   :database => "test_db"
 )
 
+DB = Sequel.postgres("test_db")
+
+at_exit do
+  ActiveRecord::Base.remove_connection
+  DB.disconnect
+  system "dropdb test_db"
+end
+
 pg = ActiveRecord::Base.connection.raw_connection
 
 pg.async_exec <<SQL
@@ -166,7 +179,6 @@ class FastBase
           end
           @#{name}
         end
-
         def #{name}=(val)
           @loaded_#{name} = true
           @#{name} = val
@@ -184,6 +196,9 @@ class Topic2 < FastBase
   table_name :topics
 end
 
+class TopicSequel < Sequel::Model(:topics)
+end
+
 def magic
   a = []
   Topic2.limit(1000).each do |t|
@@ -200,6 +215,14 @@ def ar
   a
 end
 
+def sequel
+  a = []
+  TopicSequel.limit(1000).each do |u|
+    a << u.id
+  end
+  a
+end
+
 def ar_select
   a = []
   Topic.select(:id).limit(1000).each do |u|
@@ -208,10 +231,22 @@ def ar_select
   a
 end
 
+def sequel_select
+  a = []
+  TopicSequel.select(:id).limit(1000).each do |u|
+    a << u.id
+  end
+  a
+end
+
 def ar_pluck
   Topic.limit(1000).pluck(:id)
 end
 
+def sequel_pluck
+  TopicSequel.limit(1000).select_map(:id)
+end
+
 def raw_all
   sql = -"select * from topics limit 1000"
   ActiveRecord::Base.connection.raw_connection.async_exec(sql).column_values(0)
@@ -238,8 +273,11 @@ end
 tests = %i{
   magic
   ar
+  sequel
   ar_select
+  sequel_select
   ar_pluck
+  sequel_pluck
   raw
   raw_all
 }

And these were the results:

Memory Speed
magic 0.08 MB 118 i/s
ar 3.66 MB 25 i/s
sequel 22.06 MB 2 i/s
ar_select 1.06 MB 62 i/s
sequel_select 0.27 MB 370 i/s
ar_pluck 0.21 MB 690 i/s
sequel_pluck 0.24 MB 405 i/s
raw 0.01 MB 1817 i/s
raw_all 0.01 MB 137 i/s

So, what Jeremy Evans said in this blog post – “Active Record optimizes for inefficient queries and Sequel optimizes for efficient” – is really true, in both ways. Without SELECT, Active Record is much faster and allocates much less memory than Active Record. But if we do SELECT only the id, then Sequel allocates much less memory and performs much faster than Active Record.

If we also add the sequel_pg gem, that bumps up Sequel performance significantly:

Memory Speed
magic 0.08 MB 118 i/s
ar 3.66 MB 25 i/s
sequel 2.61 MB 9 i/s
ar_select 1.06 MB 62 i/s
sequel_select 0.24 MB 581 i/s
ar_pluck 0.21 MB 690 i/s
sequel_pluck 0.02 MB 1241 i/s
raw 0.01 MB 1817 i/s
raw_all 0.01 MB 137 i/s

In this case Sequel allocates less memory than Active Record even when all columns are selected, despite Active Record’s lazy loading. So memory-wise maybe it’s not so dangerous to move to Sequel after all :slightly_smiling_face:

Steve almost 6 years ago
Steve

GitHub - CloudHealth/ar-ondemand: Fast access to database results without the memory overhead of ActiveRecord objects contains a set of functions that help with memory bloat as well when you just need to read A LOT of data.

Note the for_reading, for_streaming, for_enumeration_reading, for_enumeration_streaming functions. These make it possible to read or stream millions of records with ActiveRecord (assuming you don’t need to eager_load relations) and prevent an OutOfMemory exception. We use these in MRI and JRuby.

Sean Griffin almost 6 years ago
Sean Griffin

Given that you know I’ve been working for over a year on fast_attributes to fix this, and you have access to that repository, I’m extremely disappointed that you did not include that in your comparison. In fact, it seems like you’re actively trying to imply that no work has been done on this on the AR side. I’m not entirely sure what the goal of this article was.

Sam Saffron almost 6 years ago
Sam Saffron

@Sean_Griffin sadly rails_fast_attributes does not fix this issue:

Original example with rails_fast_attributes:

Total allocated: 3476274 bytes (23259 objects)

If anything it gets a bit slower :frowning: It does allocate 3000 or so less objects but the lion’s share are still there

               magic    252.326  (± 2.0%) i/s -      1.274k in   5.051335s
                  ar     53.296  (± 9.4%) i/s -    265.000  in   5.033042s
           ar_select    152.251  (±11.8%) i/s -    756.000  in   5.084951s
            ar_pluck      1.427k (± 3.2%) i/s -      7.140k in   5.008473s
                 raw      3.393k (± 3.2%) i/s -     16.960k in   5.004625s
             raw_all    240.900  (±31.1%) i/s -      1.044k in   5.104016s

@Janko_Marohnic 9 operations a second for the sad path is sadly not going to cut it here when AR does 25 in the sad path. What Sequel does with pluck though is pretty damn amazing, and how Sequel performs in the “happy path” is also amazing.

Sean Griffin almost 6 years ago
Sean Griffin

Ah right, you’re only using one column and it is slightly slower in that case (since pluck should be used instead). The performance difference for that gem is linear with the number of fields you select, roughly equal at 2 columns, twice as fast around 10 columns.

I’m surprised to see that many objects still allocated. Do you have stats on what objects they are?

Sam Saffron almost 6 years ago
Sam Saffron

Absolutely, I would love to help here.

My reports int the blog post are all purposely abridged, I did not include full memory profiler dumps.

I highly recommend everyone reading this to check out GitHub - SamSaffron/memory_profiler: memory_profiler for ruby and look at some of the options there / run reports.

Looking at a detailed analysis at say: report.txt · GitHub

We can see:

16000 of the allocations happen in:

        def exec_query(sql, name = "SQL", binds = [], prepare: false)
          execute_and_clear(sql, name, binds, prepare: prepare) do |result|
            types = {}
            fields = result.fields
            fields.each_with_index do |fname, i|
              ftype = result.ftype i
              fmod  = result.fmod i
              types[fname] = get_oid_type(ftype, fmod, fname)
            end
            ActiveRecord::Result.new(fields, result.values, types)
          end
        end

In particular this is calling .values which returns a giant array containing all the data. Hence I mentioned how we defer cast and not defer retrieve.

AR::Result, ActiveModel::LazyAttributeHash and ActiveModel::AttributeSet try to be lazy, but pulling the RVALUEs has already happened at that point.

AR::Result shuffles stuff

      def hash_rows
        @hash_rows ||=
          begin
            # We freeze the strings to prevent them getting duped when
            # used as keys in ActiveRecord::Base's @attributes hash
            columns = @columns.map { |c| c.dup.freeze }
            @rows.map { |row|
              # In the past we used Hash[columns.zip(row)]
              #  though elegant, the verbose way is much more efficient
              #  both time and memory wise cause it avoids a big array allocation
              #  this method is called a lot and needs to be micro optimised
              hash = {}

              index = 0
              length = columns.length

              while index < length
                hash[columns[index]] = row[index]
                index += 1
              end

              hash
            }
          end
      end

So, there is a layer of translation here between already native support provided by PG::Result and AR. Instead an `ActiveRecord::PGResult type object that defer does work and used native interfaces would be far more efficient.

Additionally there are lots of bookeeping objects some of which can be “collapsed” into a single object. In particular we have

      1000  ActiveModel::Attribute::FromDatabase (per col)
      1000  ActiveModel::AttributeSet (per row)
      1000  ActiveModel::LazyAttributeHash (per row)

And the question I have is do we need all of these objects, can a proposed ActiveRecord::PGResult just learn how to talk “AttributeSet” and then erase all of this.

Keep in mind PG already has an extensive type mapper that covers almost all general needs. Recently Lars added Time / Date (via c extension) support, it has Integer and so on. The Type mapper is extensible so ActiveRecord::PGResult could simply lean off that.

I know this makes AR internals less generic, but sadly a generic solution is too wasteful here so instead leaning on well defined interfaces allowing DB providers to “turbo charge” performance seems like a safer bet.

Sam Saffron almost 6 years ago
Sam Saffron

Keep in mind we had no example for “magic select” that would be painfully close to “raw” considering the diff between “magic” and “raw_all” is so small. It would still beat sequel pluck. Interestingly “fast_pluck” (active record patch we have) would get us a log closer to “sequel_pluck”.

One interesting reason why the “magic” pattern is not adoptable by Rails is that it keeps the whole set around, I am investigating a way of adding to the PG gem a way of yanking out a single object that defer casts and just stored the data temporarily in an efficient cstring, this will be the ideal building block for a specific AR PGResult type object.

There is lots of interesting work here the nice thing about having a problem like this is that we can do a lot of thing to shrink it. When you are fighting for the last 1% it is super hard in comparison.

Ohad Dahan over 5 years ago
Ohad Dahan

@sam
When taking the code from the gist and changing the column to created_at the object count jumps to 2K, which is expected.

As you suggested, I tried using the PG native C encoders.

ActiveRecord::Base.connection.raw_connection.type_map_for_results = PG::BasicTypeMapForResults.new ActiveRecord::Base.connection.raw_connection
ActiveRecord::Base.connection.raw_connection.type_map_for_queries = PG::BasicTypeMapForQueries.new ActiveRecord::Base.connection.raw_connection


def my_pluck2(table:, columns:, limit:)
  results = ActiveRecord::Base.connection.raw_connection.async_exec("select #{columns.join(',')} from #{table} limit #{limit}")
  columns.size == 1 ? results.column_values(0) : results.values
end

When plucking one column of type integer or float, the overall number of objects and allocation size is WAY smaller.
Make sense, since the result is one array which has “primitive” types in it.

In “non primitive” type columns like strings , time etc. the number is the same slightly larger than limit.

When extracting multiple columns the number of objects is LIMIT * NUMBER_OF_COLUMNS + a couple of extra objects.

Let me know if you want me to post a gist.

Sam Saffron over 5 years ago
Sam Saffron

Hi Ohad,

Yes this makes sense, we are hoping to get this native casting into Rails 6 now that it has properly shipped in PG thanks to Lars.

MiniSql already uses this conditional logic per: mini_sql/connection.rb at main · discourse/mini_sql · GitHub

The plan for us is to get a similar pattern into Discourse deep in the PG adapter for Rails and help drive it into Rails for version 6

Ohad Dahan over 5 years ago
Ohad Dahan

Sounds good.

I’m seeing MANY ways to get the data out of
PG::Result .

Do you have a benchmark of the various options, that you used to select the one in MiniSql ? Curious to see how they stack up.

Sam Saffron over 5 years ago
Sam Saffron

I had a few benchmarks here: https://github.com/discourse/mini_sql/tree/master/bench that I used to test out stuff

Ohad Dahan over 5 years ago
Ohad Dahan

@sam FYI , this blog post made me and a friend write:
Nativepluck / nativepluck · GitLab which will be released very soon.
The functionally (and code) is rather simple, but the improvements are great.

Nativeson / nativeson · GitLab was written due to a different article we read, the basics is similar, use native PG capabilities.

Feedback, good/bad/ugly will be appreciated.

Sam Saffron over 5 years ago
Sam Saffron

Thanks @ohaddahan looks very interesting.

Regarding nativepluck looking at:

It feels a bit simplistic, I wonder if this will work when you are plucking out of a complex relation.

Overall our strategy here with the pluck optimisation should be to try and drive it into Rails even behind a flag. DHH and Aaron are completely open to having perf improvements here.

Nativeson is a very interesting experiment, I can see how it can help you a lot with complex serialization.

Ohad Dahan over 5 years ago
Ohad Dahan

Adding nativepluck to ActiveRecord doesn’t cover all possible queries.
But we also have a Nativepluck.nativepluck method that will run to_sql on a query.
Something like:
Original: Model.where.limit.joins.pluck
New: Nativepluck.nativepluck( Model.where.limit.joins.select )

This way, we use ActiveRecord to build the query (we didn’t aim to create a full blown query builder, that part of ActiveRecord works great) , but the actual execution and casting is done by us.
We wanted to make it fully compatible with ActiveRecord but it was too much of a hassle , ActiveRecord has some serious auto-magic in it that we still haven’t figured out :slight_smile:
(but we’ll be happy if you got some pointers, we’ll be happy to expand it)

Ohad Dahan over 5 years ago
Ohad Dahan

Regarding nativeson it doesn’t have to be a complex query, although it does support them.
We just skip the whole query => ActiveRecord => JSON and go straight query => JSON.
As you can see by the benchmarks, HUGE differences.

Ohad Dahan over 5 years ago
Ohad Dahan

@sam
We’re about to merge to master a complete plug and play version of nativepluck.

It will support the exact same usage as pluck and we even support a mode to override pluck with nativepluck to reduce code changes needed.

Some of the latest benchmarks in easy to ready format:
https://github.com/ohaddahan/csv_depo/tree/master

It all started by the following line intriguing me:
We also should start leveraging the PG gem’s native type casting to avoid pulling strings out of the database only to convert them back to numbers

:slight_smile:

Tim Haines 36 days ago
Tim Haines

Just came across this blog post, and figured I’d try rerunning it with rails 7.1.3 to see what’s changed.

It looks like my hardware performs about twice as fast as your benchmarks, except for activerecord, which is only half as fast. Has AR got 4x worse at this task in the interim?

Calculating -------------------------------------
magic 516.739 (± 2.7%) i/s - 2.596k in 5.027783s
ar 31.867 (± 6.3%) i/s - 159.000 in 5.002257s
ar_select 431.722 (± 5.3%) i/s - 2.160k in 5.017038s
ar_pluck 4.305k (± 3.3%) i/s - 21.726k in 5.053700s
raw 7.612k (± 4.5%) i/s - 38.372k in 5.052593s
raw_all 569.274 (± 2.3%) i/s - 2.856k in 5.019914s

I’m on a MacBook Pro M2, Ruby 3.2.2, Rails 7.1.3

Sam Saffron 36 days ago
Sam Saffron

Not impossible that AR has regressed here further, I have not been tracking, we still use this at Discourse:

I wish we did not have to, but I am not sure how to land this in Rails.

Another change that has happened since is that we released:

Which is the piece we use for any performance sensitive work and totally outperforms almost anything you can throw at it.

Maybe run the bench on some earlier versions 7/6/5 and see what happens?


comments powered by Discourse