How do you go about debugging high CPU usage in a production Rails system?


Today I noticed one of our customer containers was running really high on CPU.

# top -bn1 

190 discour+  20   0 2266292 205128  15656 S  86.7  0.3   9377:35 ruby

# ps aux

discour+   190 19.4  0.3 2266292 207096 ?      Sl    2017 9364:38 sidekiq 5.0.5 discourse [1 of 5 busy]

Looks like sidekiq is stuck on a job.

Where is it stuck?

Usually, this is where the story ends and another series of questions start

  • Can we reproduce this on staging or development?

  • What code changed recently?

  • Why is perf trace not giving me anything I can work with?

  • How awesome is my Sidekiq logging?

  • Where is my divining rod?

Julia Evans is working on a fantastic profiler that will allow us to answer this kind of question real quick. But in the meantime, is there anything we can do?

rbtrace + stackprof

At Discourse we include rbtrace and stackprof in our Gemfile.

gem 'rbtrace'
gem 'stackprof', require: false

We always load up rbtrace in production, it allows us a large variety of production level debugging. stackprof is loaded on-demand.

In this particular case I simply run:

# rbtrace -p 190 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'

This injects a new thread that enables stackprof globally and finally writes the performance data to /tmp/perf

This dump can easily be analyzed:

# stackprof /tmp/perf
==================================
  Mode: cpu(1000)
  Samples: 475 (0.63% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       473  (99.6%)         473  (99.6%)     #<Class:0x00007fac08218020>::Typography.to_html_french
         4   (0.8%)           1   (0.2%)     #<Module:0x00007fac080c3620>.reap_connections

In fact, we can even collect backtraces and generate flamegraphs in stackprof, so I strongly recommend reading through the readme.

So there you have it, my culprit is:

text.gsub(/(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3')

A regular expression consuming 100% CPU in production, this has got to be a first :mage:

Comments

Sam Saffron 11 months ago
Sam Saffron

Some people asked me what was the text that exploded the regex and how did I fix it:

A repro is here

For my part I removed the offending code for now. I am not even entirely sure what my customer was trying to do with the regex. :flushed:

Kevin Vo 11 months ago
Kevin Vo

Can you guys help me? Do you know what’s wrong?

hello_world$ rbtrace -p 14369 -e "Thread.new{ require ‘stackprof’; StackProf.start(mode: :cpu); sleep 2; StackProf.stop;"
Error: argument --pid (pid is not listening for messages, did you require "rbtrace").

Юрий Соколов 11 months ago
Юрий Соколов

Equivalent regexp is gsub(/\s*([!?;]+(\s|\z))/, '&thinsp;\1\2') . I think it should be faster, though I didn’t check. By the way, way it double last capture (space or end)? It is already part of surrounding capture.

Sam Saffron 11 months ago
Sam Saffron

My guess here is that you did not include rbtrace in your Gemfile. You got to add it and re-deploy.

Sam Saffron 11 months ago
Sam Saffron

Thanks, honestly I am not sure why this was done in the first place. It is weird to change raw markdown that people enter into the composer and this regex was trying to do that. The correct way would be to enforce strange typography when you convert the raw markdown to cooked html.

Maciek 10 months ago
Maciek

Was the regex put there by a developer (in the code) or it was provided by a user (from a webpage and then loaded from the DB)?

Sam Saffron 10 months ago
Sam Saffron

It was provided by a developer, it was part of a customer specific plugin.

Lukáš Zapletal 10 months ago
Lukáš Zapletal

Nice. Just for the record, on Red Hat systems you can use SystemTap too. Pretty neat:

https://lukas.zapletalovi.com/2016/08/probing-ruby-20-apps-with-systemtap-in-rhel7.html

You must use system or SCL Ruby tho, wont work with RVM/RBENV compiled (unpatched) versions.

Sam Saffron 10 months ago
Sam Saffron

Hmmm is this just using the dtrace stuff? Can it not be compiled by hand somehow? What are they doing in SCL Ruby?

Lukáš Zapletal 10 months ago
Lukáš Zapletal

Yeah, SystemTap is essentialy dtrace clone used in Fedora and Red Hat clones and it is compatible with dtrace to some degree (I haven’t tried myself - it can somehow use dtrace points). Currently Ruby in Fedora and RHEL ships with a “tapset” file which enables the integration - there is no need of patching Ruby if I understand this correctly. It should work with upstream Ruby if you have SystemTap enabled kernel and tooling I believe.

In case you are interested, I’ve built a tool around SystemTap that allows you to have “top-like” experience for ad-hoc debug sessions. It’s just a shell wrapper around some examples from my blogpost: https://github.com/lzap/foreman-tracer

The stp file for current Fedora version is at https://src.fedoraproject.org/cgit/rpms/ruby.git/tree/libruby.stp

Edit: After some digging, I think that this tapset leverages dtrace hooks in Ruby upstream, thus no patches needed. Anyway the key thing is - SystemTap is present and enabled by default on Fedora, RHEL, CentOS and others. That’s super awesome. Not sure if you guys use non-redhat distros or custom kernel tho. I find this cool - no changes needed to investigate performance issues :slight_smile:

Dan Benjamin 10 months ago
Dan Benjamin

Hey Sam, this is great.
When trying to run this locally I get the following error. Have any idea whats its about:

➜  ~ ps ux | grep ruby
user 99397   0.0  0.9  4446480 154392 s001  S+    1:40PM   0:10.81 /Users/user/.rbenv/versions/2.4.1/bin/ruby bin/rails c

➜  core git:(development) ✗ rbtrace -p 99397 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'
*** attached to process 99397
*** detached from process 99397
/Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/lib/rbtrace/rbtracer.rb:324:in `send_cmd': command is too long (ArgumentError)
	from /Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/lib/rbtrace/rbtracer.rb:146:in `eval'
	from /Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/lib/rbtrace/cli.rb:506:in `run'
	from /Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/bin/rbtrace:5:in `<top (required)>'
	from /Users/user/.rbenv/versions/2.4.1/bin/rbtrace:23:in `load'
	from /Users/user/.rbenv/versions/2.4.1/bin/rbtrace:23:in `<main>'
Sam Saffron 10 months ago
Sam Saffron

We are super restricted on the amount of buffer we can pass to rbtrace (something I would like to fix) can you try squashing spaces there to see if it starts working?

'Thread.new{require "stackprof";StackProf.start(mode: :cpu);sleep 2;StackProf.stop;StackProf.results("/tmp/perf");}'

comments powered by Discourse