In this post I would like to cover a severe, extremely hard to debug vector for memory bloat in Ruby which can be triggered by the PG gem.


I have blogged a bit about the Ruby GC previously and covered some basics about malloc triggering GC runs. Over the years much in that blog post has been addressed in Ruby including dynamically growing malloc limits that mean we very rarely would need to amend malloc related GC vars.

As an aside, the only GC var Discourse still overrides is RUBY_GLOBAL_METHOD_CACHE_SIZE for reasons that are specified in the Shopify blog post by Scott Francis.

The GC in Ruby can be triggered by 2 different types of conditions.

  1. We are out of space in our managed heaps.

  2. We detected that data associated with Ruby objects via malloc calls has grown beyond a certain threshold.

In this blog post I am covering (2) and demonstrating what happens when Ruby is not aware of malloc calls.

Why malloc calls can trigger a GC?

When reading through GC.stat we may be a bit surprised to see the amount of malloc related accounting:

malloc_increase_bytes
malloc_increase_bytes_limit
oldmalloc_increase_bytes
oldmalloc_increase_bytes_limit

We keep track of the amount of memory allocated using malloc, if it hits the malloc_increase_bytes_limit we will trigger a minor GC.

When we promote an object to the old generation we also try to estimate how much malloc increased since the last major GC. This way when we promote large objects from a young heap to an old heap we have a chance to GC as soon oldmalloc_increase_bytes_limit is hit.

The oldmalloc_increase_bytes_limit and malloc_increase_bytes_limit dynamically size themselves growing as we hit GCs due to malloc limits.

Seeing this in action

Having this in place allows us to run code like this without bloating memory:

def count_malloc(desc)
  start = GC.stat[:malloc_increase_bytes]
  yield
  delta = GC.stat[:malloc_increase_bytes] - start
  puts "#{desc} allocated #{delta} bytes"
end

def process_rss
  puts 'RSS is: ' + `ps -o rss -p #{$$}`.chomp.split("\n").last
end

def malloc_limits
  s = GC.stat
  puts "malloc limit #{s[:malloc_increase_bytes_limit]}, old object malloc limit #{s[:oldmalloc_increase_bytes_limit]}"
end

puts "start RSS/limits"
process_rss
malloc_limits

count_malloc("100,000 byte string") do
  "x" * 100_000
end

x = []
10_000.times do |i|
  x[i%10]  = "x" * 100_000
end

puts "RSS/limits after allocating 10k 100,000 byte string"
malloc_limits
process_rss

Result is:

start RSS/limits
RSS is: 11692

malloc limit 16777216, old object malloc limit 16777216
100,000 byte string allocated 103296 bytes

RSS/limits after allocating 10k 100,000 byte string
malloc limit 32883343, old object malloc limit 78406160

RSS is: 42316

The key figures to watch here is.

  1. malloc_increase_bytes_limit starts at 16MB and moves up to 32MB

  2. oldmalloc_increase_bytes_limit starts at 16MB and moves up to 78MB

  3. RSS moves up from 11MB to 42MB

To recap this is a fairly well behaved non bloated process, despite allocating pretty gigantic objects (strings that have 100,000 bytes in them) and retaining a handful (10).

This is what we want and it gets a stamp of approval!

image

Where malloc accounting falls over!

Ruby does not “monkey patch” the libc malloc function to figure out how much memory got allocated.

It requires c extension authors to be very careful about how they allocate memory, in particular extension authors are expected to use all sorts of helper macros and functions when allocating and converting memory that will be tied to Ruby objects.

Unfortunately, some gems that package up c libraries do not use the helpers in some cases. This is often nobody’s explicit fault, but a culmination of a very sad series of coincidences.

I have been looking at improving Active Record performance recently and was very surprised to see this pattern everywhere:

Every time we are running a piece of SQL and getting a perfectly good PG::Result back we convert it to an array of arrays that is 100% materialized and manually discard the PG::Result object. Why is this?

Turns out, this is there for a very good reason ™

If we adapt our sample to use the PG gem to allocate the strings we see this:


require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pg'
end

require 'pg'

conn = PG.connect(dbname: 'test_db')
sql = "select repeat('x', $1)"

# simulate a Rails app by long term retaining 400_000 objects

puts "start RSS/limits"
process_rss
malloc_limits

count_malloc("100,000 bytes PG") do
  conn.exec(sql, [100_000])
end

x = []
10_000.times do |i|
  r = x[i%10] = conn.exec(sql, [100_000])
  r.clear
end

puts "RSS/limits after allocating 10k 100,000 byte strings in libpq (and clearing)"
malloc_limits
process_rss

10_000.times do |i|
  x[i%10] = conn.exec(sql, [100_000])
end

puts "RSS/limits after allocating 10k 100,000 byte strings in libpq (and NOT clearing)"
malloc_limits
process_rss

We get this:

start RSS/limits
RSS is: 27392
malloc limit 16777216, old object malloc limit 16777216
100,000 bytes PG allocated 960 bytes
RSS/limits after allocating 10k 100,000 byte strings in libpq (and clearing)
malloc limit 16777216, old object malloc limit 16777216
RSS is: 27636
RSS/limits after allocating 10k 100,000 byte strings in libpq (and NOT clearing)
malloc limit 16777216, old object malloc limit 16777216
RSS is: 295500

:warning: our RSS just jumped to 295MB when we forgot to run #clear on the results PG gave us!!!

Further more we can make the problem WAY worse if we simulate a Rails App by growing our Ruby heaps first with:

$long_term = []
400_000.times do
  $long_term << +""
end

If we run that code first we reach 1GB of RSS after “forgetting” to clear our PG::Result object!

:fearful: We can see PG allocated 100,000 bytes but Ruby was only aware of 960.

Aaron Patterson has been aware of this issue for many years, in fact he has attempted to patch libpq the library that powers the PG gem so it can handle this exact case gracefully.

See: PostgreSQL: Custom allocators in libpq

So where does this leave us?

At Discourse we notice occasional bloat in our Sidekiq process. This is despite being extremely careful to run a specific version of jemalloc that tames memory quite a bit.

Now that I am aware of this vector I do have my suspicion that some “Raw SQL” helpers we have lurking in Discourse can cause this issue. In particular we have places that return results directly in a PG::Result object. In Sidekiq, under heavy concurrency with a very large heap these objects can sneak into the old generation and be retained for way too long leading to process bloat.

This thorn also makes it very hard for us to tame Active Record memory usage cause we are stuck relying on copying entire result sets so we can stay safe, which is a very high priority for Rails.

That said, I have not given up quite yet and see quite a few paths forward. (none of which conflict):


It would be nice to drive Aaron’s patch home, if libpq provided better hooks for memory allocation we could nip this problem at the bud.

Advantages

  • This would resolve the problem at the source

Disadvantages

  • Even if this is accepted today it will be many years till people can lean on this, requires a new version of libpq many people run 5 year old versions of it.

It would be nice to have an API in libpq that allows us to interrogate how many bytes are allocated to a result it returns.

Advantages

  • This would resolve the problem at the source.
  • A much easier patch to land in libpq.
  • Ruby 2.4 and up have rb_gc_adjust_memory_usage, per #12690, so it is simple to make this change. (Thanks Eric for the tip)

Disadvantages

  • Same as above, will take many years till people can use it.

PG gem can add a Lazy results object.
In this case we simply extend the PG gem API to return a copy of the results provided by libpq that allocates significantly less Ruby objects. Then once we have the copy we can clear the result we get from libpq.

For example:

r = pg.exec('select * from table')
rs = r.result_set
r.clear

# at this point only 2 RVALUEs are allocated. 
# the new ResultSet object has internal c level storage
# pointing at an array of strings, and an API for access where it defer creates
# objects

row = rs[1]

### ResultSetRow is allocated, it also only allocates 1 RVALUE

row["abc"] # allocates a new RVALUE or returns a cached internal instance 
row[1] # same

rs.get(1,100) # same as above

Advantages

  • This drops in to ActiveRecord and other ORMs as the best practice for grabbing data if #clear is not guaranteed

  • Reasonably efficient, only allocates a very minimal number of Ruby objects

  • We can start using this very soon

Disadvantages

  • We are forced to make memory copies of results returned via PG, this has a non zero cost (I suspect it is not too high though compared to 1000s of Ruby objects that need to be garbage collected with #values calls)

Build tooling to detect this problem in production apps! It would be amazing if when we saw a Ruby app that is bloated in memory we could run a simple diagnostic on it to figure out where the bloat is coming from.

  • Is the bloat there due to glibc arenas?

  • Is the bloat there cause Ruby is not aware of a bunch of allocated memory?

  • Is the bloat there due to a simple managed leak, eg: an ever growing array?

It is a hard problem to solve though. jemalloc does provide a lot of internal diagnostics, so we could look at the delta between what jemalloc has allocated and what Ruby knows about!

Advantages

  • Would increase visibility of this problem and the family of related problems and allow us to alert various gem authors if they are impacted by it.

Disadvantages

  • Hard to build and may require a custom startup.

What we are doing?

I have invested many hours investigating these issues. Discourse is actively investing in improving the memory story in Ruby. Together with Shopify and Appfolio we are sponsoring Eric Wong to experiment and improve Ruby for the next few months.

Discourse are also looking to throw more dollars behind a project to heavily improve Active Record for the 6.0 release which I plan to blog about soon. We also plan to extract, improve, formalize and share our built in raw SQL helpers.

I hope you found this helpful and as always, enjoy!

Comments

K Sec 6 days ago
K Sec

I think Active Record 6 were also aiming to have Multi-DB by default. Making scaling easy. If it also gain some speed boost it will be a huge win. I think right now Active Record is what causes most of the performance problems.

Using Fortune Cookies, which is the closest to real world workload, assuming only Full Stack frameworks and Full stack ORM, we can see the bottom and top has a 40x difference. If we only look at Play 2, which I think is closer to Rails, we still have roughly 10x difference. Not saying Active Record will ever achieve that, but there seems to be lots of room for improvement.

I am sure a lot of people, companies will also be willing to throw in some money for a heavily improved Active Record.

Kenn Ejima 5 days ago
Kenn Ejima

Is this something we should be worried about on mysql2 also?

Sam Saffron 5 days ago
Sam Saffron

@Kenn I am not sure I recommend running similar tests to the one’s in the blog post to confirm.

@K_Sec I am not sure about making too many assumptions based on those benchmarks, for example the Rails bench runs via NGINX, is still on Rails 4.2, does not have the same perf optimizations when sorting rows that other implementations have and so on.

From our usage at Discourse despite our pains with AR performance we still have our general bottleneck be the database vs the ORM. Yes we want to reduce memory usage and ORM overhead, but … when you are talking total of 50ms in SQL and 40ms in ActiveRecord extras, sure the cost is high, but even if you move it down to 0ms you still have 50ms in SQL.

I strongly disagree “fortune cookies” is “closest to real world workload”. It is running a single query to a tiny table. This is not indicative of real world workload.

Marat Amerov 5 days ago
Marat Amerov

Maybe fresh libpq should ship with pg gem and optionally don’t use libpq from system package?

Noah Gibbs 4 days ago
Noah Gibbs

Hey, Sam! Just did a quick-and-dirty port of this to MySQL, trying to interpret the numbers. I’m also not seeing an obvious equivalent to .clear for the mysql2 gem, but I’m probably just missing it.

Here’s the code: mysql_bloat_test/mb_test.rb at master · noahgibbs/mysql_bloat_test · GitHub

Here’s what I see on my Mac laptop (without clear):

start RSS/limits
RSS is:  20228
malloc limit 16777216, old object malloc limit 16777216

100,000 bytes MySQL allocated 102816 bytes
RSS/limits after allocating 10k 100,000 byte strings in mysql2 (with no #clear or equivalent)

malloc limit 25288021, old object malloc limit 33438324
RSS is: 124796

That looks like not the full RSS bloat (note: this is without the 400k objects to simulate Rails), but quite a lot of increase in the malloc/oldmalloc limits. The mysql2 code also looks like it registers more of its memory with Ruby (e.g. mysql2/result.c at master · brianmario/mysql2 · GitHub), so maybe that’s helping?

I’ll see if I can figure out the equivalent of .clear and get back to you with more numbers.

Noah Gibbs 4 days ago
Noah Gibbs

Okay, looks like it’s .free for mysql2. Here’s what I get by replacing .clear with .free:

start RSS/limits
RSS is:  20644
malloc limit 16777216, old object malloc limit 16777216

100,000 bytes MySQL allocated 102816 bytes

RSS/limits after allocating 10k 100,000 byte strings in mysql2 (with .free)
malloc limit 25288021, old object malloc limit 33438324
RSS is: 124128

RSS/limits after allocating 10k 100,000 byte strings in mysql2 (with NO .free)
malloc limit 26330719, old object malloc limit 38567847
RSS is: 133072

I’ve pushed the source version that does it that way.

Sam Saffron 4 days ago
Sam Saffron

To me it looks like MySQL is behaving pretty nicely, the objects it creates are heavier than simple String RVALUEs however it is giving plenty of signal to the Ruby GC.

Sam Saffron 4 days ago
Sam Saffron

This is a very very interesting idea, though picking a target may be a bit tricky, afaik certain features are added depending on libpq, but if there is a minimal supported version this could work, for sure. Windows is probably going to be a problem in this department.


comments powered by Discourse