Ruby's external malloc problem
over 6 years ago
I have blogged a bit about the Ruby GC previously and covered some basics about malloc triggering GC runs. Over the years much in that blog post has been addressed in Ruby including dynamically growing malloc limits that mean we very rarely would need to amend malloc related GC vars.
As an aside, the only GC var Discourse still overrides is RUBY_GLOBAL_METHOD_CACHE_SIZE for reasons that are specified in the Shopify blog post by Scott Francis.
The GC in Ruby can be triggered by 2 different types of conditions.
-
We are out of space in our managed heaps.
-
We detected that data associated with Ruby objects via malloc calls has grown beyond a certain threshold.
In this blog post I am covering (2) and demonstrating what happens when Ruby is not aware of malloc calls.
Why malloc calls can trigger a GC?
When reading through GC.stat
we may be a bit surprised to see the amount of malloc related accounting:
malloc_increase_bytes
malloc_increase_bytes_limit
oldmalloc_increase_bytes
oldmalloc_increase_bytes_limit
We keep track of the amount of memory allocated using malloc, if it hits the malloc_increase_bytes_limit
we will trigger a minor GC.
When we promote an object to the old generation we also try to estimate how much malloc increased since the last major GC. This way when we promote large objects from a young heap to an old heap we have a chance to GC as soon oldmalloc_increase_bytes_limit
is hit.
The oldmalloc_increase_bytes_limit
and malloc_increase_bytes_limit
dynamically size themselves growing as we hit GCs due to malloc limits.
Seeing this in action
Having this in place allows us to run code like this without bloating memory:
def count_malloc(desc)
start = GC.stat[:malloc_increase_bytes]
yield
delta = GC.stat[:malloc_increase_bytes] - start
puts "#{desc} allocated #{delta} bytes"
end
def process_rss
puts 'RSS is: ' + `ps -o rss -p #{$$}`.chomp.split("\n").last
end
def malloc_limits
s = GC.stat
puts "malloc limit #{s[:malloc_increase_bytes_limit]}, old object malloc limit #{s[:oldmalloc_increase_bytes_limit]}"
end
puts "start RSS/limits"
process_rss
malloc_limits
count_malloc("100,000 byte string") do
"x" * 100_000
end
x = []
10_000.times do |i|
x[i%10] = "x" * 100_000
end
puts "RSS/limits after allocating 10k 100,000 byte string"
malloc_limits
process_rss
Result is:
start RSS/limits
RSS is: 11692
malloc limit 16777216, old object malloc limit 16777216
100,000 byte string allocated 103296 bytes
RSS/limits after allocating 10k 100,000 byte string
malloc limit 32883343, old object malloc limit 78406160
RSS is: 42316
The key figures to watch here is.
-
malloc_increase_bytes_limit
starts at 16MB and moves up to 32MB -
oldmalloc_increase_bytes_limit
starts at 16MB and moves up to 78MB -
RSS moves up from 11MB to 42MB
To recap this is a fairly well behaved non bloated process, despite allocating pretty gigantic objects (strings that have 100,000 bytes in them) and retaining a handful (10).
This is what we want and it gets a stamp of approval!
Where malloc accounting falls over!
Ruby does not “monkey patch” the libc malloc function to figure out how much memory got allocated.
It requires c extension authors to be very careful about how they allocate memory, in particular extension authors are expected to use all sorts of helper macros and functions when allocating and converting memory that will be tied to Ruby objects.
Unfortunately, some gems that package up c libraries do not use the helpers in some cases. This is often nobody’s explicit fault, but a culmination of a very sad series of coincidences.
I have been looking at improving Active Record performance recently and was very surprised to see this pattern everywhere:
Every time we are running a piece of SQL and getting a perfectly good PG::Result
back we convert it to an array of arrays that is 100% materialized and manually discard the PG::Result
object. Why is this?
Turns out, this is there for a very good reason ™
If we adapt our sample to use the PG gem to allocate the strings we see this:
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'pg'
end
require 'pg'
conn = PG.connect(dbname: 'test_db')
sql = "select repeat('x', $1)"
# simulate a Rails app by long term retaining 400_000 objects
puts "start RSS/limits"
process_rss
malloc_limits
count_malloc("100,000 bytes PG") do
conn.exec(sql, [100_000])
end
x = []
10_000.times do |i|
r = x[i%10] = conn.exec(sql, [100_000])
r.clear
end
puts "RSS/limits after allocating 10k 100,000 byte strings in libpq (and clearing)"
malloc_limits
process_rss
10_000.times do |i|
x[i%10] = conn.exec(sql, [100_000])
end
puts "RSS/limits after allocating 10k 100,000 byte strings in libpq (and NOT clearing)"
malloc_limits
process_rss
We get this:
start RSS/limits
RSS is: 27392
malloc limit 16777216, old object malloc limit 16777216
100,000 bytes PG allocated 960 bytes
RSS/limits after allocating 10k 100,000 byte strings in libpq (and clearing)
malloc limit 16777216, old object malloc limit 16777216
RSS is: 27636
RSS/limits after allocating 10k 100,000 byte strings in libpq (and NOT clearing)
malloc limit 16777216, old object malloc limit 16777216
RSS is: 295500
our RSS just jumped to 295MB when we forgot to run #clear
on the results PG gave us!!!
Further more we can make the problem WAY worse if we simulate a Rails App by growing our Ruby heaps first with:
$long_term = []
400_000.times do
$long_term << +""
end
If we run that code first we reach 1GB of RSS after “forgetting” to clear our PG::Result
object!
We can see PG allocated 100,000 bytes but Ruby was only aware of 960.
Aaron Patterson has been aware of this issue for many years, in fact he has attempted to patch libpq the library that powers the PG gem so it can handle this exact case gracefully.
See: PostgreSQL: Custom allocators in libpq
So where does this leave us?
At Discourse we notice occasional bloat in our Sidekiq process. This is despite being extremely careful to run a specific version of jemalloc that tames memory quite a bit.
Now that I am aware of this vector I do have my suspicion that some “Raw SQL” helpers we have lurking in Discourse can cause this issue. In particular we have places that return results directly in a PG::Result
object. In Sidekiq, under heavy concurrency with a very large heap these objects can sneak into the old generation and be retained for way too long leading to process bloat.
This thorn also makes it very hard for us to tame Active Record memory usage cause we are stuck relying on copying entire result sets so we can stay safe, which is a very high priority for Rails.
That said, I have not given up quite yet and see quite a few paths forward. (none of which conflict):
It would be nice to drive Aaron’s patch home, if libpq provided better hooks for memory allocation we could nip this problem at the bud.
Advantages
- This would resolve the problem at the source
Disadvantages
- Even if this is accepted today it will be many years till people can lean on this, requires a new version of libpq many people run 5 year old versions of it.
It would be nice to have an API in libpq that allows us to interrogate how many bytes are allocated to a result it returns.
Advantages
- This would resolve the problem at the source.
- A much easier patch to land in libpq.
- Ruby 2.4 and up have
rb_gc_adjust_memory_usage
, per #12690, so it is simple to make this change. (Thanks Eric for the tip)
Disadvantages
- Same as above, will take many years till people can use it.
PG gem can add a Lazy
results object.
In this case we simply extend the PG gem API to return a copy of the results provided by libpq that allocates significantly less Ruby objects. Then once we have the copy we can clear the result we get from libpq.
For example:
r = pg.exec('select * from table')
rs = r.result_set
r.clear
# at this point only 2 RVALUEs are allocated.
# the new ResultSet object has internal c level storage
# pointing at an array of strings, and an API for access where it defer creates
# objects
row = rs[1]
### ResultSetRow is allocated, it also only allocates 1 RVALUE
row["abc"] # allocates a new RVALUE or returns a cached internal instance
row[1] # same
rs.get(1,100) # same as above
Advantages
-
This drops in to ActiveRecord and other ORMs as the best practice for grabbing data if #clear is not guaranteed
-
Reasonably efficient, only allocates a very minimal number of Ruby objects
-
We can start using this very soon
Disadvantages
- We are forced to make memory copies of results returned via PG, this has a non zero cost (I suspect it is not too high though compared to 1000s of Ruby objects that need to be garbage collected with
#values
calls)
Build tooling to detect this problem in production apps! It would be amazing if when we saw a Ruby app that is bloated in memory we could run a simple diagnostic on it to figure out where the bloat is coming from.
-
Is the bloat there due to glibc arenas?
-
Is the bloat there cause Ruby is not aware of a bunch of allocated memory?
-
Is the bloat there due to a simple managed leak, eg: an ever growing array?
It is a hard problem to solve though. jemalloc does provide a lot of internal diagnostics, so we could look at the delta between what jemalloc has allocated and what Ruby knows about!
Advantages
- Would increase visibility of this problem and the family of related problems and allow us to alert various gem authors if they are impacted by it.
Disadvantages
- Hard to build and may require a custom startup.
What we are doing?
I have invested many hours investigating these issues. Discourse is actively investing in improving the memory story in Ruby. Together with Shopify and Appfolio we are sponsoring Eric Wong to experiment and improve Ruby for the next few months.
Discourse are also looking to throw more dollars behind a project to heavily improve Active Record for the 6.0 release which I plan to blog about soon. We also plan to extract, improve, formalize and share our built in raw SQL helpers.
I hope you found this helpful and as always, enjoy!
I think Active Record 6 were also aiming to have Multi-DB by default. Making scaling easy. If it also gain some speed boost it will be a huge win. I think right now Active Record is what causes most of the performance problems.
Using Fortune Cookies, which is the closest to real world workload, assuming only Full Stack frameworks and Full stack ORM, we can see the bottom and top has a 40x difference. If we only look at Play 2, which I think is closer to Rails, we still have roughly 10x difference. Not saying Active Record will ever achieve that, but there seems to be lots of room for improvement.
I am sure a lot of people, companies will also be willing to throw in some money for a heavily improved Active Record.