Debugging memory leaks in Ruby

over 9 years ago

At some point in the life of every Rails developer you are bound to hit a memory leak. It may be tiny amount of constant memory growth, or a spurt of growth that hits you on the job queue when certain jobs run.

Sadly, most Ruby devs out there simply employ monit , inspeqtor or unicorn worker killers. This allows you to move along and do more important work, tidily sweeping the problem snugly under the carpet.

Unfortunately, this approach leads to quite a few bad side effects. Besides performance pains, instability and larger memory requirements it also leaves the community with a general lack of confidence in Ruby. Monitoring and restarting processes is a very important tool for your arsenal, but at best it is a stopgap and safeguard. It is not a solution.

We have some amazing tools out there for attacking and resolving memory leaks, especially the easy ones - managed memory leaks.

Are we leaking ?

The first and most important step for dealing with memory problems is graphing memory over time. At Discourse we use a combo of Graphite , statsd and Grafana to graph application metrics.

A while back I packaged up a Docker image for this work which is pretty close to what we are running today. If rolling your own is not your thing you could look at New Relic, Datadog or any other cloud based metric provider. The key metric you first need to track is RSS for you key Ruby processes. At Discourse we look at max RSS for Unicorn our web server and Sidekiq our job queue.

Discourse is deployed on multiple machines in multiple Docker containers. We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker. It uses docker exec to get all sorts of information about the processes running inside the container.
Note: Discourse uses the unicorn master process to launch multiple workers and job queues, it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world.

With this information at hand we can easily graph RSS for any container on any machine and look at trends:

Long term graphs are critical to all memory leak analysis. They allow us to see when an issue started. They allow us to see the rate of memory growth and shape of memory growth. Is it erratic? Is it correlated to a job running?

When dealing with memory leaks in c-extensions, having this information is critical. Isolating c-extension memory leaks often involves valgrind and custom compiled versions of Ruby that support debugging with valgrind. It is tremendously hard work we only want to deal with as a last resort. It is much simpler to isolate that a trend started after upgrading EventMachine to version 1.0.5.

Managed memory leaks

Unlike unmanaged memory leaks, tackling managed leaks is very straight forward. The new tooling in Ruby 2.1 and up makes debugging these leaks a breeze.

Prior to Ruby 2.1 the best we could do was crawl our object space, grab a snapshot, wait a bit, grab a second snapshot and compare. I have a basic implementation of this shipped with Discourse in MemoryDiagnostics, it is rather tricky to get this to work right. You have to fork your process when gathering the snapshots so you do not interfere with your process and the information you can glean is fairly basic. We can tell certain objects leaked, but can not tell where they were allocated:

3377 objects have leaked
Summary:
String: 1703
Bignum: 1674

Sample Items:
Bignum: 40 bytes
Bignum: 40 bytes
String: 
Bignum: 40 bytes
String: 
Bignum: 40 bytes
String:

If we were lucky we would have leaked a number or string that is revealing and that would be enough to nail it down.

Additionally we have GC.stat that could tell us how many live objects we have and so on.

The information was very limited. We can tell we have a leak quite clearly, but finding the reason can be very difficult.

Note: a very interesting metric to graph is GC.stat[:heap_live_slots] with that information at hand we can easily tell if we have a managed object leak.

Managed heap dumping

Ruby 2.1 introduced heap dumping, if you also enable allocation tracing you have some tremendously interesting information.

The process for collecting a useful heap dump is quite simple:

Turn on allocation tracing

require 'objspace'
ObjectSpace.trace_object_allocations_start

This will slow down your process significantly and cause your process to consume more memory. However, it is key to collecting good information and can be turned off later. For my analysis I will usually run it directly after boot.

When debugging my latest round memory issues with SideKiq at Discourse I deployed an extra docker image to a spare server. This allowed me extreme freedom without impacting SLA.

Next, play the waiting game

After memory has clearly leaked, (you can look at GC.stat or simply watch RSS to determine this happened) run:

io=File.open("/tmp/my_dump", "w")
ObjectSpace.dump_all(output: io); 
io.close

Running Ruby in an already running process

To make this process work we need to run Ruby inside a process that has already started.

Luckily, the rbtrace gem allows us to do that (and much more). Moreover it is safe to run in production.

We can easily force SideKiq to dump a its heap with:

bundle exec rbtrace -p $SIDEKIQ_PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'

rbtrace runs in a restricted context, a nifty trick is breaking out of the trap context with Thread.new

We can also crack information out of the box with rbtrace, for example:

bundle exec rbtrace -p 6744 -e 'GC.stat'
/usr/local/bin/ruby: warning: RUBY_HEAP_MIN_SLOTS is obsolete. Use RUBY_GC_HEAP_INIT_SLOTS instead.
*** attached to process 6744
>> GC.stat
=> {:count=>49, :heap_allocated_pages=>1960, :heap_sorted_length=>1960, :heap_allocatable_pages=>12, :heap_available_slots=>798894, :heap_live_slots=>591531, :heap_free_slots=>207363, :heap_final_slots=>0, :heap_marked_slots=>335775, :heap_swept_slots=>463124, :heap_eden_pages=>1948, :heap_tomb_pages=>12, :total_allocated_pages=>1960, :total_freed_pages=>0, :total_allocated_objects=>13783631, :total_freed_objects=>13192100, :malloc_increase_bytes=>32568600, :malloc_increase_bytes_limit=>33554432, :minor_gc_count=>41, :major_gc_count=>8, :remembered_wb_unprotected_objects=>12175, :remembered_wb_unprotected_objects_limit=>23418, :old_objects=>309750, :old_objects_limit=>618416, :oldmalloc_increase_bytes=>32783288, :oldmalloc_increase_bytes_limit=>44484250}
*** detached from process 6744

###Analyzing the heap dump

With a rich heap dump at hand we can start analysis, a first report to run is looking at the count of object per GC generation.

When trace object allocation is enabled the runtime will attach rich information next to all allocations. For each object that is allocated while tracing is on we have:

The GC generation it was allocated in
The filename and line number it was allocated in
A truncated value
bytesize
… and much more

The file is in json format and can easily be parsed line by line. EG:

{"address":"0x7ffc567fbf98", "type":"STRING", "class":"0x7ffc565c4ea0", "frozen":true, "embedded":true, "fstring":true, "bytesize":18, "value":"ensure in dispatch", "file":"/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.9/lib/active_support/dependencies.rb", "line":247, "method":"require", "generation":7, "memsize":40, "flags":{"wb_protected":true, "old":true, "long_lived":true, "marked":true}}

A simple report that shows how many objects are left around from which GC generation is a great start for looking at memory leaks. Its a timeline of object leakage.

require 'json'
class Analyzer
  def initialize(filename)
    @filename = filename
  end

  def analyze
    data = []
    File.open(@filename) do |f|
        f.each_line do |line|
          data << (parsed=JSON.parse(line))
        end
    end

    data.group_by{|row| row["generation"]}
        .sort{|a,b| a[0].to_i <=> b[0].to_i}
        .each do |k,v|
          puts "generation #{k} objects #{v.count}"
        end
  end
end

Analyzer.new(ARGV[0]).analyze

For example this is what I started with:

generation  objects 334181
generation 7 objects 6629
generation 8 objects 38383
generation 9 objects 2220
generation 10 objects 208
generation 11 objects 110
generation 12 objects 489
generation 13 objects 505
generation 14 objects 1297
generation 15 objects 638
generation 16 objects 748
generation 17 objects 1023
generation 18 objects 805
generation 19 objects 407
generation 20 objects 126
generation 21 objects 1708
generation 22 objects 369
...

We expect a large number of objects to be retained after boot and sporadically when requiring new dependencies. However we do not expect a consistent amount of objects to be allocated and never cleaned up. So let’s zoom into a particular generation:

require 'json'
class Analyzer
  def initialize(filename)
    @filename = filename
  end

  def analyze
    data = []
    File.open(@filename) do |f|
        f.each_line do |line|
          parsed=JSON.parse(line)
          data << parsed if parsed["generation"] == 18
        end
    end
    data.group_by{|row| "#{row["file"]}:#{row["line"]}"}
        .sort{|a,b| b[1].count <=> a[1].count}
        .each do |k,v|
          puts "#{k} * #{v.count}"
        end
  end
end

Analyzer.new(ARGV[0]).analyze

generation 19 objects 407
/usr/local/lib/ruby/2.2.0/weakref.rb:87 * 144
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21 * 72
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42 * 72
/var/www/discourse/lib/freedom_patches/translate_accelerator.rb:65 * 15
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/i18n-0.7.0/lib/i18n/interpolate/ruby.rb:21 * 15
/var/www/discourse/lib/email/message_builder.rb:85 * 9
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/actionview-4.1.9/lib/action_view/template.rb:297 * 6
/var/www/discourse/lib/email/message_builder.rb:36 * 6
/var/www/discourse/lib/email/message_builder.rb:89 * 6
/var/www/discourse/lib/email/message_builder.rb:46 * 6
/var/www/discourse/lib/email/message_builder.rb:66 * 6
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activerecord-4.1.9/lib/active_record/connection_adapters/postgresql_adapter.rb:515 * 5

Further more, we can chase down the reference path to see who is holding the references to the various objects and rebuild object graphs.

The first thing I attacked in this particular case was code I wrote, which is a monkey patch to Rails localization.

Why we monkey patch rails localization?

At Discourse we patch the Rails localization subsystem for 2 reasons:

Early on we discovered it was slow and needed better performance
Recently we started accumulating an enormous amount of translations and needed to ensure we only load translations on demand to keep memory usage lower. (this saves us 20MB of RSS)

Consider the following bench:

ENV['RAILS_ENV'] = 'production'
require 'benchmark/ips'

require File.expand_path("../../config/environment", __FILE__)

Benchmark.ips do |b|
  b.report do |times|
    i = -1
    I18n.t('posts') while (i+=1) < times
  end
end

Before monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating -------------------------------------
                         4.518k i/100ms
-------------------------------------------------
                        121.230k (±11.0%) i/s -    600.894k

After monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating -------------------------------------
                        22.509k i/100ms
-------------------------------------------------
                        464.295k (±10.4%) i/s -      2.296M

So our localization system is running almost 4 times faster, but … it is leaking memory.

Reviewing the code I could see that the offending line is:

github.com

discourse/discourse/blob/3c6aede1aa98a8456b00ab2d1e01b3f35466323c/lib/freedom_patches/translate_accelerator.rb#L65


      
              # load it
              I18n.backend.load_translations(I18n.load_path.grep Regexp.new("\\.#{locale}\\.yml$"))
          
              @loaded_locales << locale
            end
          end
          
          def translate(*args)
            @cache ||= LruRedux::ThreadSafeCache.new(LRU_CACHE_SIZE)
            found = true
            k = [args, config.locale, config.backend.object_id]
            t = @cache.fetch(k) { found = false }
            unless found
              load_locale(config.locale) unless @loaded_locales.include?(config.locale)
              begin
                t = translate_no_cache(*args)
              rescue MissingInterpolationArgument
                options = args.last.is_a?(Hash) ? args.pop.dup : {}
                options.merge!(locale: config.default_locale)
                key = args.shift
                t = translate_no_cache(key, options)

Turns out, we were sending a hash in from the email message builder that includes ActiveRecord objects, this later was used as a key for the cache, the cache was allowing for 2000 entries. Considering that each entry could involve a large number of AR objects memory leakage was very high.

To mitigate, I changed the keying strategy, shrunk the cache and completely bypassed it for complex localizations:

One day later when looking at memory graphs we can easily see the impact of this change:

This clearly did not stop the leak but it definitely slowed it down.

###therubyracer is leaking

At the top of our list we can see our JavaScript engine therubyracer is leaking lots of objects, in particular we can see the weak references it uses to maintain Ruby to JavaScript mappings are being kept around for way too long.

To maintain performance at Discourse we keep a JavaScript engine context around for turning Markdown into HTML. This engine is rather expensive to boot up, so we keep it in memory plugging in new variables as we bake posts.

Since our code is fairly isolated a repro is trivial, first let’s see how many objects we are leaking using the memory_profiler gem:

ENV['RAILS_ENV'] = 'production'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__)

# warmup
3.times{PrettyText.cook("hello world")}

MemoryProfiler.report do
  50.times{PrettyText.cook("hello world")}
end.pretty_print

At the top of our report we see:

retained objects by location
-----------------------------------
       901  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/2.1.0/weakref.rb:87
       901  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21
       600  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42
       250  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/context.rb:97
        50  /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/object.rb:8

So we are leaking 54 or so objects each time we cook a post, this adds up really fast. We may also be leaking unmanaged memory here which could compound the problem.

Since we have a line number it is easy to track down the source of the leak

require 'weakref'
class Ref
  def initialize(object)
    @ref = ::WeakRef.new(object)
  end
  def object
    @ref.__getobj__
  rescue ::WeakRef::RefError
    nil
  end
end

class WeakValueMap
   def initialize
      @values = {}
   end

   def [](key)
      if ref = @values[key]
        ref.object
      end
   end

   def []=(key, value)
     @values[key] = V8::Weak::Ref.new(value)
   end
end

This WeakValueMap object keeps growing forever and nothing seems to be clearing values from it. The intention with the WeakRef usage was to ensure we allow these objects to be released when no-one holds references. Trouble is that the reference to the wrapper is now kept around for the entire lifetime of the JavaScript context.

A fix is pretty straight forward:

class WeakValueMap
  def initialize
    @values = {}
  end

  def [](key)
    if ref = @values[key]
      ref.object
    end
  end

  def []=(key, value)
    ref = V8::Weak::Ref.new(value)
    ObjectSpace.define_finalizer(value, self.class.ensure_cleanup(@values, key, ref))

    @values[key] = ref
  end

  def self.ensure_cleanup(values,key,ref)
    proc {
      values.delete(key) if values[key] == ref
    }
  end
end

We define a finalizer on the wrapped object that ensures we clean up the all these wrapping objects and keep on WeakValueMap small.

The effect is staggering:

ENV['RAILS_ENV'] = 'production'
require 'objspace'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__)

def rss
 `ps -eo pid,rss | grep #{Process.pid} | awk '{print $2}'`.to_i
end

PrettyText.cook("hello world")

# MemoryProfiler has a helper that runs the GC multiple times to make
#  sure all objects that can be freed are freed.
MemoryProfiler::Helpers.full_gc
puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}"

20.times do

  5000.times { |i|
    PrettyText.cook("hello world")
  }
  MemoryProfiler::Helpers.full_gc
  puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}"

end

Before

rss: 137660 live objects 306775
rss: 259888 live objects 570055
rss: 301944 live objects 798467
rss: 332612 live objects 1052167
rss: 349328 live objects 1268447
rss: 411184 live objects 1494003
rss: 454588 live objects 1734071
rss: 451648 live objects 1976027
rss: 467364 live objects 2197295
rss: 536948 live objects 2448667
rss: 600696 live objects 2677959
rss: 613720 live objects 2891671
rss: 622716 live objects 3140339
rss: 624032 live objects 3368979
rss: 640780 live objects 3596883
rss: 641928 live objects 3820451
rss: 640112 live objects 4035747
rss: 722712 live objects 4278779
/home/sam/Source/discourse/lib/pretty_text.rb:185:in `block in markdown': Script Timed Out (PrettyText::JavaScriptError)
	from /home/sam/Source/discourse/lib/pretty_text.rb:350:in `block in protect'
	from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `synchronize'
	from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `protect'
	from /home/sam/Source/discourse/lib/pretty_text.rb:161:in `markdown'
	from /home/sam/Source/discourse/lib/pretty_text.rb:218:in `cook'
	from tmp/mem_leak.rb:30:in `block (2 levels) in <main>'
	from tmp/mem_leak.rb:29:in `times'
	from tmp/mem_leak.rb:29:in `block in <main>'
	from tmp/mem_leak.rb:27:in `times'
	from tmp/mem_leak.rb:27:in `<main>'

After

rss: 137556 live objects 306646
rss: 259576 live objects 314866
rss: 261052 live objects 336258
rss: 268052 live objects 333226
rss: 269516 live objects 327710
rss: 270436 live objects 338718
rss: 269828 live objects 329114
rss: 269064 live objects 325514
rss: 271112 live objects 337218
rss: 271224 live objects 327934
rss: 273624 live objects 343234
rss: 271752 live objects 333038
rss: 270212 live objects 329618
rss: 272004 live objects 340978
rss: 270160 live objects 333350
rss: 271084 live objects 319266
rss: 272012 live objects 339874
rss: 271564 live objects 331226
rss: 270544 live objects 322366
rss: 268480 live objects 333990
rss: 271676 live objects 330654

Looks like memory stabilizes and live object count stabilizes after the fix.

github.com/rubyjs/therubyracer

FIX: severe memory leak in WeakValueMap

rubyjs:master ← SamSaffron:master

opened 01:47AM - 01 Apr 15 UTC

SamSaffron

+10 -1

entries were not being cleaned up correctly from the backing store. Impact is hu…ge: ``` ENV['RAILS_ENV'] = 'production' require 'objspace' require 'memory_profiler' require File.expand_path("../../config/environment", __FILE__) def rss `ps -eo pid,rss | grep #{Process.pid} | awk '{print $2}'`.to_i end PrettyText.cook("hello world") # MemoryProfiler has a helper that runs the GC multiple times to make # sure all objects that can be freed are freed. MemoryProfiler::Helpers.full_gc puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}" 20.times do 5000.times { |i| PrettyText.cook("hello world") } MemoryProfiler::Helpers.full_gc puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}" end ``` **Before** ``` text rss: 137660 live objects 306775 rss: 259888 live objects 570055 rss: 301944 live objects 798467 rss: 332612 live objects 1052167 rss: 349328 live objects 1268447 rss: 411184 live objects 1494003 rss: 454588 live objects 1734071 rss: 451648 live objects 1976027 rss: 467364 live objects 2197295 rss: 536948 live objects 2448667 rss: 600696 live objects 2677959 rss: 613720 live objects 2891671 rss: 622716 live objects 3140339 rss: 624032 live objects 3368979 rss: 640780 live objects 3596883 rss: 641928 live objects 3820451 rss: 640112 live objects 4035747 rss: 722712 live objects 4278779 /home/sam/Source/discourse/lib/pretty_text.rb:185:in `block in markdown': Script Timed Out (PrettyText::JavaScriptError) from /home/sam/Source/discourse/lib/pretty_text.rb:350:in `block in protect' from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `synchronize' from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `protect' from /home/sam/Source/discourse/lib/pretty_text.rb:161:in `markdown' from /home/sam/Source/discourse/lib/pretty_text.rb:218:in `cook' from tmp/mem_leak.rb:30:in `block (2 levels) in <main>' from tmp/mem_leak.rb:29:in `times' from tmp/mem_leak.rb:29:in `block in <main>' from tmp/mem_leak.rb:27:in `times' from tmp/mem_leak.rb:27:in `<main>' ``` **After** ``` text rss: 137556 live objects 306646 rss: 259576 live objects 314866 rss: 261052 live objects 336258 rss: 268052 live objects 333226 rss: 269516 live objects 327710 rss: 270436 live objects 338718 rss: 269828 live objects 329114 rss: 269064 live objects 325514 rss: 271112 live objects 337218 rss: 271224 live objects 327934 rss: 273624 live objects 343234 rss: 271752 live objects 333038 rss: 270212 live objects 329618 rss: 272004 live objects 340978 rss: 270160 live objects 333350 rss: 271084 live objects 319266 rss: 272012 live objects 339874 rss: 271564 live objects 331226 rss: 270544 live objects 322366 rss: 268480 live objects 333990 rss: 271676 live objects 330654 ``` Recommend a new stable release with this asap. Deploying a monkey patch to Discourse

Summary

The current tooling offered in Ruby offer spectacular visibility into the Ruby runtime. The tooling around this new instrumentation is improving but still rather rough at spots.

As a .NET developer in a previous lifetime I really missed the excellent memory profiler tool, luckily we now have all the information needed to be able to build tools like this going forward.

Good luck hunting memory leaks, I hope this information helps you and please think twice next time you deploy a “unicorn OOM killer”.

A huge thank you goes to Koichi Sasada and Aman Gupta for building us all this new memory profiling infrastructure.

PS: another excellent resource worth reading is Oleg Dashevskii’s “How I spent two weeks hunting a memory leak in Ruby”

2019 EDIT: I followed up this article with: Debugging hidden memory leaks in Ruby

Posted by: Sam Permalink | Comments (20)

Comments

Steve Gooberman Hill over 9 years ago

very good and interesting article. Ruby has come a long way in some of these areas. There are also possibly some underlying issues we only show up with ruby compilation on alternate (and possiblynot brilliantly supported platforms). I personally have seen memory corruption issues appearing in floats on some ARM platforms with ruby 1.8 (Floats get filled with 0xFF’s).

Thxs for sharing this

Steve

about 9 years ago

Thanks for this, quite helpful! I tried your heap dump method, however, the result odd, I keep getting exactly the same lines over and over again using your analyze code. I can run my app, visit a few pages, dump, analyze an the generations and object counts are always exactly the same. What am I doing wrong?

Sam Saffron about 9 years ago

Did you run this before you started waiting?

require 'objspace'
ObjectSpace.trace_object_allocations_start

about 9 years ago

Yep, in an initializer of my Rails App. Do I have to run it before every request?

Sam Saffron about 9 years ago

Which web server are you using? What version of Ruby?

about 9 years ago

Unicorn with one worker. I used the process ID of that worker. Ruby 2.2.0.

Sam Saffron about 9 years ago

Did you try running:

ObjectSpace.trace_object_allocations_start

after fork?

about 9 years ago

No, I will try that, thx.

about 9 years ago

It didn’t help. Maybe i am looking at the wrong process, Ill try again.

about 9 years ago

Strange, i looked at the web process now and I get exactly the same analysis result. Odd. Before I looked at a worker process.

RAILS_ENV=production foreman start
09:59:11 web.1 | started with pid 76453
09:59:12 web.1 | I, [2015-05-07T09:59:12.784873 #76453] INFO – : Refreshing Gem list
09:59:26 web.1 | I, [2015-05-07T09:59:26.978072 #76453] INFO – : listening on addr=0.0.0.0:5000 fd=11
09:59:26 web.1 | I, [2015-05-07T09:59:26.980775 #76453] INFO – : master process ready
09:59:26 web.1 | I, [2015-05-07T09:59:26.984145 #76463] INFO – : worker=0 ready

Edit: So odd, I keep getting this, despite the heap dump files being different size each time:

2.2.0 :029 > Analyzer.new(Rails.root + “/tmp/ruby-heap.dump”).analyze
generation objects 371850
generation 45 objects 23359
generation 46 objects 5424
generation 47 objects 16213
generation 48 objects 21012
generation 49 objects 7679
generation 50 objects 20706
generation 51 objects 38938
generation 52 objects 20452
generation 53 objects 18786
generation 54 objects 40028
generation 55 objects 17656
generation 56 objects 40657
generation 57 objects 1803
generation 58 objects 630
generation 59 objects 240
generation 60 objects 216
generation 61 objects 359
generation 62 objects 635
generation 63 objects 640
generation 64 objects 29
generation 65 objects 1478
generation 66 objects 9

Sam Saffron about 9 years ago

That looks good, keep running it for a bit, how does it look after say 30 minutes / 1 hour?

Then have a look at the a particular generation to see what kind of objects were retained.

about 9 years ago

Okay, well, I did try that a few times and it keeps giving me the same object counts and generations, which is puzzling me. It doesn’t even matter at wich process I run rbtrace on and if the heap dump files have very different file sizes. But I will try for longer.

about 9 years ago

Okay, was running the app for an hour, opening a page that usually causes unlimited memory growth. Got the exact same output. Seems wrong to me.

about 9 years ago

LOL! Found the mistake, got the file path wrong.

Rails.root + "/tmp/ruby-heap.dump"
=> #<Pathname:/tmp/ruby-heap.dump>

Haha, it analyzed the file I generated much earlier at /tmp/ruby-heap.dump instead of the one in my rails apps folder.

So far the analysis points to the rails router (journey) for a potential leak…

Most retained objects in this line of code: Cat.new(val.first, val.last) in the journey parser. Wonder where Cat is defined and what it does.

Ah, it’s probably ActionDispatch::Journey::Nodes::Cat

But strange, the line is only called at boot time.

Sam Saffron about 9 years ago

Once you figure out what it is, let me know, try to get some real long running tests and graphs, it will help you a lot.

Optimisticanshul almost 9 years ago

bundle exec rbtrace -p $SIDEKIQ_PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'

running this isn’t generating /tmp/ruby-heap.dump

Sergey Alekseev over 8 years ago

Your article inspired me to create a list of Ruby gems that have memory leaks - GitHub - ASoftCo/leaky-gems: A list of Ruby gems that have known memory leaks (and issues). There are more gems with C code extensions that do not release memory. Check this list out!

Ryan Mulligan about 8 years ago

Thanks for the post: it got me interested in statsd, which seems great. How do you populate statsd with the Unicorn worker memory information?

Sam Saffron about 8 years ago

I had an older version of this here:

github.com

discourse/discourse_docker/blob/master/image/monitor/src/monitor.rb

require 'statsd-ruby'
require 'docker'

$statsd = Statsd.new '10.0.0.1', 8125

module Docker
  class CloseConnectionError < StandardError; end
  class Container
    def name
      info["Names"].first[1..-1]
    end

    def stats
      path = path_for(:stats)

      result = nil

      streamer = lambda do |chunk, remaining, total|
        result ||= chunk
        raise CloseConnectionError if result

This file has been truncated. show original

Container is here:

github.com

discourse/discourse_docker/blob/master/image/monitor/Dockerfile

# Used to gather information about CPU and memory
#  reporting it back to statsd

# samsaffron/discourse_monitor
# version 0.0.2

FROM samsaffron/discourse_base:1.0.7
#LABEL maintainer="Sam Saffron \"https://twitter.com/samsaffron\""

ADD src/monitor.rb src/monitor.rb
RUN gem install statsd-ruby docker-api

CMD ruby src/monitor.rb

Our current version is a bit more fancy, but follows the same pattern

Rus Talis almost 7 years ago

Sam… Great forensic work , blowing the lid off leaks… Except it leaves a shallow pit , why it takes 300,000 processes to launch an empty App…mem leakage plugged by stay-alive processes the coder has no idea exist… can these randomly spawn (as do updates and google features) from an overactive OS ?

…mem growth over time is a sure loss of code control… im looking to deploy a 1mil open session client request “discovery” database & certainly wont be using Ruby / Rails… Containers are great but rkt is superior to mainstram Docker… Do u recommend any database in an IB RDMA cluster?

The Elixir model of a 2nd process wachin each process (graceful flow control) is similar at a more atomic level… again is leakage managed by GC (cant stand ground) or by trimming parameters , as in yor case . the tools to pinpoint and kill such processes, or at least actively adjust (update) as per Hot Reloading… For websockets specifically, this allows deployments to a machine that might have millions of active connections without immediately bombarding the server with millions of reconnection attempts, losing any in progress messages. That’s why WhatsApp is built on Erlang, by the way. Hot reloading has been used to deploy updates to flight computers while planes were in the air.

WHAT ABOUT WATCHING THE OS Kernel ?
We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker.

It uses docker exec to get all sorts of information about the processes running inside the container.

I am not sure what is a one container per process world ? that does not exist …

" it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world."

Great work… in showing how stealthly mem leakage truly is… only thing i cant agree with is how subservient “After the Fix” appears … cant see how 300,000 live objects is a happy state (apart from improvement on 4.2million) who needs RSS … That’s Not a Memory Leak, It’s Bloat – EngineYard

A high client connect (one mil) server cluster experience, needs to be an available FreeBSD or Centos install using PF_ring… Out of the box framework If u come across one, pls let me know before my cluster kills us… mine is internally switched IB RDMA OFED standard , that pretty much knocks out httpLite offerings… OFED rewrites how data is handled (kernel bypass) Thus i cant see any point using outdated DB structures (and complexity causing bloat

Sam Saffron