At some point in the life of every Rails developer you are bound to hit a memory leak. It may be tiny amount of constant memory growth, or a spurt of growth that hits you on the job queue when certain jobs run.

Sadly, most Ruby devs out there simply employ monit , inspeqtor or unicorn worker killers. This allows you to move along and do more important work, tidily sweeping the problem snugly under the carpet.

Unfortunately, this approach leads to quite a few bad side effects. Besides performance pains, instability and larger memory requirements it also leaves the community with a general lack of confidence in Ruby. Monitoring and restarting processes is a very important tool for your arsenal, but at best it is a stopgap and safeguard. It is not a solution.

We have some amazing tools out there for attacking and resolving memory leaks, especially the easy ones - managed memory leaks.

Are we leaking ?

The first and most important step for dealing with memory problems is graphing memory over time. At Discourse we use a combo of Graphite , statsd and Grafana to graph application metrics.

A while back I packaged up a Docker image for this work which is pretty close to what we are running today. If rolling your own is not your thing you could look at New Relic, Datadog or any other cloud based metric provider. The key metric you first need to track is RSS for you key Ruby processes. At Discourse we look at max RSS for Unicorn our web server and Sidekiq our job queue.

Discourse is deployed on multiple machines in multiple Docker containers. We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker. It uses docker exec to get all sorts of information about the processes running inside the container.
Note: Discourse uses the unicorn master process to launch multiple workers and job queues, it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world.

With this information at hand we can easily graph RSS for any container on any machine and look at trends:

Long term graphs are critical to all memory leak analysis. They allow us to see when an issue started. They allow us to see the rate of memory growth and shape of memory growth. Is it erratic? Is it correlated to a job running?

When dealing with memory leaks in c-extensions, having this information is critical. Isolating c-extension memory leaks often involves valgrind and custom compiled versions of Ruby that support debugging with valgrind. It is tremendously hard work we only want to deal with as a last resort. It is much simpler to isolate that a trend started after upgrading EventMachine to version 1.0.5.

Managed memory leaks

Unlike unmanaged memory leaks, tackling managed leaks is very straight forward. The new tooling in Ruby 2.1 and up makes debugging these leaks a breeze.

Prior to Ruby 2.1 the best we could do was crawl our object space, grab a snapshot, wait a bit, grab a second snapshot and compare. I have a basic implementation of this shipped with Discourse in MemoryDiagnostics, it is rather tricky to get this to work right. You have to fork your process when gathering the snapshots so you do not interfere with your process and the information you can glean is fairly basic. We can tell certain objects leaked, but can not tell where they were allocated:

3377 objects have leaked
Summary:
String: 1703
Bignum: 1674

Sample Items:
Bignum: 40 bytes
Bignum: 40 bytes
String: 
Bignum: 40 bytes
String: 
Bignum: 40 bytes
String:

If we were lucky we would have leaked a number or string that is revealing and that would be enough to nail it down.

Additionally we have GC.stat that could tell us how many live objects we have and so on.

The information was very limited. We can tell we have a leak quite clearly, but finding the reason can be very difficult.

Note: a very interesting metric to graph is GC.stat[:heap_live_slots] with that information at hand we can easily tell if we have a managed object leak.

Managed heap dumping

Ruby 2.1 introduced heap dumping, if you also enable allocation tracing you have some tremendously interesting information.

The process for collecting a useful heap dump is quite simple:

Turn on allocation tracing

require 'objspace'
ObjectSpace.trace_object_allocations_start

This will slow down your process significantly and cause your process to consume more memory. However, it is key to collecting good information and can be turned off later. For my analysis I will usually run it directly after boot.

When debugging my latest round memory issues with SideKiq at Discourse I deployed an extra docker image to a spare server. This allowed me extreme freedom without impacting SLA.

Next, play the waiting game

After memory has clearly leaked, (you can look at GC.stat or simply watch RSS to determine this happened) run:

io=File.open("/tmp/my_dump", "w")
ObjectSpace.dump_all(output: io); 
io.close

Running Ruby in an already running process

To make this process work we need to run Ruby inside a process that has already started.

Luckily, the rbtrace gem allows us to do that (and much more). Moreover it is safe to run in production.

We can easily force SideKiq to dump a its heap with:

bundle exec rbtrace -p $SIDEKIQ_PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'

rbtrace runs in a restricted context, a nifty trick is breaking out of the trap context with Thread.new

We can also crack information out of the box with rbtrace, for example:

bundle exec rbtrace -p 6744 -e 'GC.stat'
/usr/local/bin/ruby: warning: RUBY_HEAP_MIN_SLOTS is obsolete. Use RUBY_GC_HEAP_INIT_SLOTS instead.
*** attached to process 6744
>> GC.stat
=> {:count=>49, :heap_allocated_pages=>1960, :heap_sorted_length=>1960, :heap_allocatable_pages=>12, :heap_available_slots=>798894, :heap_live_slots=>591531, :heap_free_slots=>207363, :heap_final_slots=>0, :heap_marked_slots=>335775, :heap_swept_slots=>463124, :heap_eden_pages=>1948, :heap_tomb_pages=>12, :total_allocated_pages=>1960, :total_freed_pages=>0, :total_allocated_objects=>13783631, :total_freed_objects=>13192100, :malloc_increase_bytes=>32568600, :malloc_increase_bytes_limit=>33554432, :minor_gc_count=>41, :major_gc_count=>8, :remembered_wb_unprotected_objects=>12175, :remembered_wb_unprotected_objects_limit=>23418, :old_objects=>309750, :old_objects_limit=>618416, :oldmalloc_increase_bytes=>32783288, :oldmalloc_increase_bytes_limit=>44484250}
*** detached from process 6744

Analyzing the heap dump

With a rich heap dump at hand we can start analysis, a first report to run is looking at the count of object per GC generation.

When trace object allocation is enabled the runtime will attach rich information next to all allocations. For each object that is allocated while tracing is on we have:

  1. The GC generation it was allocated in
  2. The filename and line number it was allocated in
  3. A truncated value
  4. bytesize
  5. ... and much more

The file is in json format and can easily be parsed line by line. EG:

{"address":"0x7ffc567fbf98", "type":"STRING", "class":"0x7ffc565c4ea0", "frozen":true, "embedded":true, "fstring":true, "bytesize":18, "value":"ensure in dispatch", "file":"/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.9/lib/active_support/dependencies.rb", "line":247, "method":"require", "generation":7, "memsize":40, "flags":{"wb_protected":true, "old":true, "long_lived":true, "marked":true}}

A simple report that shows how many objects are left around from which GC generation is a great start for looking at memory leaks. Its a timeline of object leakage.

require 'json'
class Analyzer
  def initialize(filename)
    @filename = filename
  end

  def analyze
    data = []
    File.open(@filename) do |f|
        f.each_line do |line|
          data << (parsed=JSON.parse(line))
        end
    end

    data.group_by{|row| row["generation"]}
        .sort{|a,b| a[0].to_i <=> b[0].to_i}
        .each do |k,v|
          puts "generation #{k} objects #{v.count}"
        end
  end
end

Analyzer.new(ARGV[0]).analyze

For example this is what I started with:

generation  objects 334181
generation 7 objects 6629
generation 8 objects 38383
generation 9 objects 2220
generation 10 objects 208
generation 11 objects 110
generation 12 objects 489
generation 13 objects 505
generation 14 objects 1297
generation 15 objects 638
generation 16 objects 748
generation 17 objects 1023
generation 18 objects 805
generation 19 objects 407
generation 20 objects 126
generation 21 objects 1708
generation 22 objects 369
...

We expect a large number of objects to be retained after boot and sporadically when requiring new dependencies. However we do not expect a consistent amount of objects to be allocated and never cleaned up. So let's zoom into a particular generation:

require 'json'
class Analyzer
  def initialize(filename)
    @filename = filename
  end

  def analyze
    data = []
    File.open(@filename) do |f|
        f.each_line do |line|
          parsed=JSON.parse(line)
          data << parsed if parsed["generation"] == 18
        end
    end
    data.group_by{|row| "#{row["file"]}:#{row["line"]}"}
        .sort{|a,b| b[1].count <=> a[1].count}
        .each do |k,v|
          puts "#{k} * #{v.count}"
        end
  end
end

Analyzer.new(ARGV[0]).analyze

generation 19 objects 407
/usr/local/lib/ruby/2.2.0/weakref.rb:87 * 144
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21 * 72
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42 * 72
/var/www/discourse/lib/freedom_patches/translate_accelerator.rb:65 * 15
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/i18n-0.7.0/lib/i18n/interpolate/ruby.rb:21 * 15
/var/www/discourse/lib/email/message_builder.rb:85 * 9
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/actionview-4.1.9/lib/action_view/template.rb:297 * 6
/var/www/discourse/lib/email/message_builder.rb:36 * 6
/var/www/discourse/lib/email/message_builder.rb:89 * 6
/var/www/discourse/lib/email/message_builder.rb:46 * 6
/var/www/discourse/lib/email/message_builder.rb:66 * 6
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activerecord-4.1.9/lib/active_record/connection_adapters/postgresql_adapter.rb:515 * 5

Further more, we can chase down the reference path to see who is holding the references to the various objects and rebuild object graphs.

The first thing I attacked in this particular case was code I wrote, which is a monkey patch to Rails localization.

Why we monkey patch rails localization?

At Discourse we patch the Rails localization subsystem for 2 reasons:

  1. Early on we discovered it was slow and needed better performance
  2. Recently we started accumulating an enormous amount of translations and needed to ensure we only load translations on demand to keep memory usage lower. (this saves us 20MB of RSS)

Consider the following bench:

ENV['RAILS_ENV'] = 'production'
require 'benchmark/ips'

require File.expand_path("../../config/environment", __FILE__)

Benchmark.ips do |b|
  b.report do |times|
    i = -1
    I18n.t('posts') while (i+=1) < times
  end
end

Before monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating -------------------------------------
                         4.518k i/100ms
-------------------------------------------------
                        121.230k (±11.0%) i/s -    600.894k

After monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating -------------------------------------
                        22.509k i/100ms
-------------------------------------------------
                        464.295k (±10.4%) i/s -      2.296M

So our localization system is running almost 4 times faster, but ... it is leaking memory.

Reviewing the code I could see that the offending line is:

Turns out, we were sending a hash in from the email message builder that includes ActiveRecord objects, this later was used as a key for the cache, the cache was allowing for 2000 entries. Considering that each entry could involve a large number of AR objects memory leakage was very high.

To mitigate, I changed the keying strategy, shrunk the cache and completely bypassed it for complex localizations:

One day later when looking at memory graphs we can easily see the impact of this change:

This clearly did not stop the leak but it definitely slowed it down.

therubyracer is leaking

At the top of our list we can see our JavaScript engine therubyracer is leaking lots of objects, in particular we can see the weak references it uses to maintain Ruby to JavaScript mappings are being kept around for way too long.

To maintain performance at Discourse we keep a JavaScript engine context around for turning Markdown into HTML. This engine is rather expensive to boot up, so we keep it in memory plugging in new variables as we bake posts.

Since our code is fairly isolated a repro is trivial, first let's see how many objects we are leaking using the memory_profiler gem:

ENV['RAILS_ENV'] = 'production'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__)

# warmup
3.times{PrettyText.cook("hello world")}

MemoryProfiler.report do
  50.times{PrettyText.cook("hello world")}
end.pretty_print

At the top of our report we see:

retained objects by location
-----------------------------------
       901  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/2.1.0/weakref.rb:87
       901  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21
       600  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42
       250  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/context.rb:97
        50  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/object.rb:8

So we are leaking 54 or so objects each time we cook a post, this adds up really fast. We may also be leaking unmanaged memory here which could compound the problem.

Since we have a line number it is easy to track down the source of the leak

require 'weakref'
class Ref
  def initialize(object)
    @ref = ::WeakRef.new(object)
  end
  def object
    @ref.__getobj__
  rescue ::WeakRef::RefError
    nil
  end
end

class WeakValueMap
   def initialize
      @values = {}
   end

   def [](key)
      if ref = @values[key]
        ref.object
      end
   end

   def []=(key, value)
     @values[key] = V8::Weak::Ref.new(value)
   end
end

This WeakValueMap object keeps growing forever and nothing seems to be clearing values from it. The intention with the WeakRef usage was to ensure we allow these objects to be released when no-one holds references. Trouble is that the reference to the wrapper is now kept around for the entire lifetime of the JavaScript context.

A fix is pretty straight forward:

class WeakValueMap
  def initialize
    @values = {}
  end

  def [](key)
    if ref = @values[key]
      ref.object
    end
  end

  def []=(key, value)
    ref = V8::Weak::Ref.new(value)
    ObjectSpace.define_finalizer(value, self.class.ensure_cleanup(@values, key, ref))

    @values[key] = ref
  end

  def self.ensure_cleanup(values,key,ref)
    proc {
      values.delete(key) if values[key] == ref
    }
  end
end

We define a finalizer on the wrapped object that ensures we clean up the all these wrapping objects and keep on WeakValueMap small.

The effect is staggering:

ENV['RAILS_ENV'] = 'production'
require 'objspace'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__)

def rss
 `ps -eo pid,rss | grep #{Process.pid} | awk '{print $2}'`.to_i
end

PrettyText.cook("hello world")

# MemoryProfiler has a helper that runs the GC multiple times to make
#  sure all objects that can be freed are freed.
MemoryProfiler::Helpers.full_gc
puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}"

20.times do

  5000.times { |i|
    PrettyText.cook("hello world")
  }
  MemoryProfiler::Helpers.full_gc
  puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}"

end

Before

rss: 137660 live objects 306775
rss: 259888 live objects 570055
rss: 301944 live objects 798467
rss: 332612 live objects 1052167
rss: 349328 live objects 1268447
rss: 411184 live objects 1494003
rss: 454588 live objects 1734071
rss: 451648 live objects 1976027
rss: 467364 live objects 2197295
rss: 536948 live objects 2448667
rss: 600696 live objects 2677959
rss: 613720 live objects 2891671
rss: 622716 live objects 3140339
rss: 624032 live objects 3368979
rss: 640780 live objects 3596883
rss: 641928 live objects 3820451
rss: 640112 live objects 4035747
rss: 722712 live objects 4278779
/home/sam/Source/discourse/lib/pretty_text.rb:185:in `block in markdown': Script Timed Out (PrettyText::JavaScriptError)
	from /home/sam/Source/discourse/lib/pretty_text.rb:350:in `block in protect'
	from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `synchronize'
	from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `protect'
	from /home/sam/Source/discourse/lib/pretty_text.rb:161:in `markdown'
	from /home/sam/Source/discourse/lib/pretty_text.rb:218:in `cook'
	from tmp/mem_leak.rb:30:in `block (2 levels) in <main>'
	from tmp/mem_leak.rb:29:in `times'
	from tmp/mem_leak.rb:29:in `block in <main>'
	from tmp/mem_leak.rb:27:in `times'
	from tmp/mem_leak.rb:27:in `<main>'

After

rss: 137556 live objects 306646
rss: 259576 live objects 314866
rss: 261052 live objects 336258
rss: 268052 live objects 333226
rss: 269516 live objects 327710
rss: 270436 live objects 338718
rss: 269828 live objects 329114
rss: 269064 live objects 325514
rss: 271112 live objects 337218
rss: 271224 live objects 327934
rss: 273624 live objects 343234
rss: 271752 live objects 333038
rss: 270212 live objects 329618
rss: 272004 live objects 340978
rss: 270160 live objects 333350
rss: 271084 live objects 319266
rss: 272012 live objects 339874
rss: 271564 live objects 331226
rss: 270544 live objects 322366
rss: 268480 live objects 333990
rss: 271676 live objects 330654

Looks like memory stabilizes and live object count stabilizes after the fix.

Summary

The current tooling offered in Ruby offer spectacular visibility into the Ruby runtime. The tooling around this new instrumentation is improving but still rather rough at spots.

As a .NET developer in a previous lifetime I really missed the excellent memory profiler tool, luckily we now have all the information needed to be able to build tools like this going forward.

Good luck hunting memory leaks, I hope this information helps you and please think twice next time you deploy a "unicorn OOM killer".

A huge thank you goes to Koichi Sasada and Aman Gupta for building us all this new memory profiling infrastructure.

PS: another excellent resource worth reading is Oleg Dashevskii's "How I spent two weeks hunting a memory leak in Ruby"

Comments

Steve Gooberman Hill over 1 year ago
Steve Gooberman Hill

very good and interesting article. Ruby has come a long way in some of these areas. There are also possibly some underlying issues we only show up with ruby compilation on alternate (and possiblynot brilliantly supported platforms). I personally have seen memory corruption issues appearing in floats on some ARM platforms with ruby 1.8 (Floats get filled with 0xFF's).

Thxs for sharing this

Steve

over 1 year ago

Thanks for this, quite helpful! I tried your heap dump method, however, the result odd, I keep getting exactly the same lines over and over again using your analyze code. I can run my app, visit a few pages, dump, analyze an the generations and object counts are always exactly the same. What am I doing wrong?

Sam Saffron over 1 year ago
Sam Saffron

Did you run this before you started waiting?

require 'objspace'
ObjectSpace.trace_object_allocations_start
over 1 year ago

Yep, in an initializer of my Rails App. Do I have to run it before every request?

Sam Saffron over 1 year ago
Sam Saffron

Which web server are you using? What version of Ruby?

over 1 year ago

Unicorn with one worker. I used the process ID of that worker. Ruby 2.2.0.

Sam Saffron over 1 year ago
Sam Saffron

Did you try running:

ObjectSpace.trace_object_allocations_start

after fork?

over 1 year ago

No, I will try that, thx.

over 1 year ago

It didn't help. Maybe i am looking at the wrong process, Ill try again.

over 1 year ago

Strange, i looked at the web process now and I get exactly the same analysis result. Odd. Before I looked at a worker process.

RAILS_ENV=production foreman start
09:59:11 web.1 | started with pid 76453
09:59:12 web.1 | I, [2015-05-07T09:59:12.784873 #76453] INFO -- : Refreshing Gem list
09:59:26 web.1 | I, [2015-05-07T09:59:26.978072 #76453] INFO -- : listening on addr=0.0.0.0:5000 fd=11
09:59:26 web.1 | I, [2015-05-07T09:59:26.980775 #76453] INFO -- : master process ready
09:59:26 web.1 | I, [2015-05-07T09:59:26.984145 #76463] INFO -- : worker=0 ready

Edit: So odd, I keep getting this, despite the heap dump files being different size each time:

2.2.0 :029 > Analyzer.new(Rails.root + "/tmp/ruby-heap.dump").analyze
generation objects 371850
generation 45 objects 23359
generation 46 objects 5424
generation 47 objects 16213
generation 48 objects 21012
generation 49 objects 7679
generation 50 objects 20706
generation 51 objects 38938
generation 52 objects 20452
generation 53 objects 18786
generation 54 objects 40028
generation 55 objects 17656
generation 56 objects 40657
generation 57 objects 1803
generation 58 objects 630
generation 59 objects 240
generation 60 objects 216
generation 61 objects 359
generation 62 objects 635
generation 63 objects 640
generation 64 objects 29
generation 65 objects 1478
generation 66 objects 9

Sam Saffron over 1 year ago
Sam Saffron

That looks good, keep running it for a bit, how does it look after say 30 minutes / 1 hour?

Then have a look at the a particular generation to see what kind of objects were retained.

over 1 year ago

Okay, well, I did try that a few times and it keeps giving me the same object counts and generations, which is puzzling me. It doesn't even matter at wich process I run rbtrace on and if the heap dump files have very different file sizes. But I will try for longer.

over 1 year ago

Okay, was running the app for an hour, opening a page that usually causes unlimited memory growth. Got the exact same output. Seems wrong to me.

over 1 year ago

LOL! Found the mistake, got the file path wrong.

Rails.root + "/tmp/ruby-heap.dump"
=> #<Pathname:/tmp/ruby-heap.dump>

Haha, it analyzed the file I generated much earlier at /tmp/ruby-heap.dump instead of the one in my rails apps folder. :smile:

So far the analysis points to the rails router (journey) for a potential leak...

Most retained objects in this line of code: Cat.new(val.first, val.last) in the journey parser. Wonder where Cat is defined and what it does.

Ah, it's probably ActionDispatch::Journey::Nodes::Cat

But strange, the line is only called at boot time.

Sam Saffron over 1 year ago
Sam Saffron

Once you figure out what it is, let me know, try to get some real long running tests and graphs, it will help you a lot.

Optimisticanshul about 1 year ago
Optimisticanshul
bundle exec rbtrace -p $SIDEKIQ_PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'

running this isn't generating /tmp/ruby-heap.dump

Sergey Alekseev about 1 year ago
Sergey Alekseev

Your article inspired me to create a list of Ruby gems that have memory leaks - https://github.com/ASoftCo/leaky-gems. There are more gems with C code extensions that do not release memory. Check this list out!

Ryan Mulligan 6 months ago
Ryan Mulligan

Thanks for the post: it got me interested in statsd, which seems great. How do you populate statsd with the Unicorn worker memory information?

Sam Saffron 6 months ago
Sam Saffron

I had an older version of this here:

Container is here:

Our current version is a bit more fancy, but follows the same pattern


comments powered by Discourse