Debugging 100% CPU usage in production Ruby on Rails systems
about 7 years ago
Today I noticed one of our customer containers was running really high on CPU.
# top -bn1
190 discour+ 20 0 2266292 205128 15656 S 86.7 0.3 9377:35 ruby
# ps aux
discour+ 190 19.4 0.3 2266292 207096 ? Sl 2017 9364:38 sidekiq 5.0.5 discourse [1 of 5 busy]
Looks like sidekiq is stuck on a job.
Where is it stuck?
Usually, this is where the story ends and another series of questions start
-
Can we reproduce this on staging or development?
-
What code changed recently?
-
Why is
perf trace
not giving me anything I can work with? -
How awesome is my Sidekiq logging?
-
Where is my divining rod?
Julia Evans is working on a fantastic profiler that will allow us to answer this kind of question real quick. But in the meantime, is there anything we can do?
rbtrace + stackprof
At Discourse we include rbtrace and stackprof in our Gemfile.
gem 'rbtrace'
gem 'stackprof', require: false
We always load up rbtrace in production, it allows us a large variety of production level debugging. stackprof is loaded on-demand.
In this particular case I simply run:
# rbtrace -p 190 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'
This injects a new thread that enables stackprof globally and finally writes the performance data to /tmp/perf
This dump can easily be analyzed:
# stackprof /tmp/perf
==================================
Mode: cpu(1000)
Samples: 475 (0.63% miss rate)
GC: 0 (0.00%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
473 (99.6%) 473 (99.6%) #<Class:0x00007fac08218020>::Typography.to_html_french
4 (0.8%) 1 (0.2%) #<Module:0x00007fac080c3620>.reap_connections
In fact, we can even collect backtraces and generate flamegraphs in stackprof, so I strongly recommend reading through the readme.
So there you have it, my culprit is:
text.gsub(/(\s|)+([!?;]+(\s|\z))/, ' \2\3')
A regular expression consuming 100% CPU in production, this has got to be a first
Some people asked me what was the text that exploded the regex and how did I fix it:
A repro is here
For my part I removed the offending code for now. I am not even entirely sure what my customer was trying to do with the regex.