Fastest way to profile a method in Ruby
almost 7 years ago
Lately I needed to add some instruments into Discourse. I wanted every request to measure:
-
How many SQL statements and Redis commands are executed?
-
How much time was spent in those sections?
This can easily be abstracted to a more generalize problem of
class ExternalClass
def method_to_measure
end
end
Solving this problem breaks down into a few particular sub-problems
-
What is the fastest way of measuring elapsed time in a method?
-
What is the fastest patch to a method that can add instrumentation?
-
How does one account for multithreading?
Fastest way to measure a method
The first and simplest problem to solve is pretending we do not have to patch anything.
If we had access to all the source and could add all the instruments we wanted to, what is the quickest way of measuring elapsed time?
There are two techniques we can use in Ruby.
-
Create a
Time
object before and after and delta. -
Use
Process.clock_gettime
to return aFloat
before and after and delta.
Let’s measure:
require 'benchmark/ips'
class Test
attr_reader :count, :duration
def initialize
@count = 0
@duration = 0.0
end
def work
end
def method
work
end
def time_method
@count += 1
t = Time.now
work
ensure
@duration += Time.now - t
end
def process_clock_get_time_method
@count += 1
t = Process.clock_gettime(Process::CLOCK_MONOTONIC)
work
ensure
@duration += Process.clock_gettime(Process::CLOCK_MONOTONIC) - t
end
end
t = Test.new
Benchmark.ips do |b|
b.report "method" do |times|
i = 0
while i < times
t.method
i += 1
end
end
b.report "time_method" do |times|
i = 0
while i < times
t.time_method
i += 1
end
end
b.report "process_clock_get_time_method" do |times|
i = 0
while i < times
t.process_clock_get_time_method
i += 1
end
end
end
# Calculating -------------------------------------
# method 19.623M (± 3.5%) i/s - 98.227M in 5.012204s
# time_method 1.596M (± 1.2%) i/s - 8.061M in 5.050321s
# process_clock_get_time_method
# 4.972M (± 1.7%) i/s - 24.908M in 5.011634s
As expected Process.clock_gettime(Process::CLOCK_MONOTONIC)
is fastest cause no ruby object is allocated and later garbage collected, it also happens to be the most accurate way of measuring duration.
So let’s use that from now.
A note on “not doing work”
Our method above was doing no work, if we start doing work, the difference will be much less observable.
For example, if I replace work with $redis.get('x')
which is pretty much the shortest time I want to instrument I will get:
method 17.567k (± 3.0%) i/s - 88.230k in 5.027068s
time_method 16.730k (± 3.2%) i/s - 84.864k in 5.077868s
process_clock_get_time_method
17.112k (± 2.7%) i/s - 85.884k in 5.022470s
So we have 17.5 vs 17.11 vs 16.7 thousand redis gets a second. Fastest being uninstrumented, slowest being using Time.now.
This is not a HUGE difference like the original benchmark, but you can not argue that Time.now
is a tax you want to pay when an faster and safer alternative exists.
Fastest way to “patch” a method
When instrumenting code we do not own, we do not have the luxury of making source level changes.
This leaves us with 3 alternatives:
- Using
prepend
to prepend a module in the class and override the original method - Using
alias_method
and chaining our new method around the original - Replacing the method with a “patched” version of the method
Replacing the method is fraught with risk, cause once the original method in a 3rd party library changes we need to fix our replacement. It is unmaintainable long term. An interesting option that may pop up later is outputting the instruction sequence using RubyVM::InstructionSequence#of
, patching the instruction sequence and then loading up the method. This will clearly be the fastest way of patching, but is not quite available yet cause instruction sequence manipulation is still a bit of a dark art.
So let’s compare prepend
with alias_method
, what is going to be faster?
require 'benchmark/ips'
#require 'redis'
$redis = Redis.new
class Test
attr_reader :count
def initialize
@count = 0
end
def work
# $redis.get 'X'
end
def method
work
end
def method_with_prepend
work
end
def method_with_alias_method
work
end
def method_with_manual_instrument
@count += 1
work
end
end
module PrependInstrument
def method_with_prepend
@count += 1
super
end
end
class Test; prepend PrependInstrument; end
class Test
alias_method :method_with_alias_method_orig, :method_with_alias_method
def method_with_alias_method
@count += 1
method_with_alias_method_orig
end
end
t = Test.new
Benchmark.ips do |b|
b.report "method" do |times|
i = 0
while i < times
t.method
i += 1
end
end
b.report "method_with_prepend" do |times|
i = 0
while i < times
t.method_with_prepend
i += 1
end
end
b.report "method_with_alias_method" do |times|
i = 0
while i < times
t.method_with_alias_method
i += 1
end
end
b.report "method_with_manual_instrument" do |times|
i = 0
while i < times
t.method_with_manual_instrument
i += 1
end
end
end
# method 20.403M (± 1.6%) i/s - 102.084M in 5.004684s
# method_with_prepend 10.339M (± 1.5%) i/s - 51.777M in 5.009321s
# method_with_alias_method
# 13.067M (± 1.8%) i/s - 65.649M in 5.025786s
# method_with_manual_instrument
# 16.581M (± 1.6%) i/s - 83.145M in 5.015797s
So it looks like the very elegant prepend
method of patching methods is slower than the old school alias method trick.
However once you introduce work the difference is far more marginal, this is the same bench with $redis.get "x"
as the unit of work
method 17.259k (± 3.1%) i/s - 86.892k in 5.039493s
method_with_prepend 17.056k (± 2.8%) i/s - 85.782k in 5.033548s
method_with_alias_method
17.464k (± 2.4%) i/s - 87.464k in 5.011238s
method_with_manual_instrument
17.369k (± 2.8%) i/s - 87.204k in 5.024699s
Performance is so close it is hard to notice anything.
Accounting for threading
One big real world concern I had for my use case is ensuring I only measure work from my current thread.
In Discourse some background threads could be running doing deferred work and I do not want that to interfere with the measurements on the thread servicing the web. In a Puma based environment this is supercritical cause lots of threads could be accessing the same global thread safe Redis connection, you may also want to “enable” instrumenting prior to having a connection open.
So we can start by measuring the impact of thread safety.
require 'benchmark/ips'
require 'redis'
$redis = Redis.new
class Test
attr_reader :count
def initialize
@count = 0
@thread = Thread.current
data = Struct.new(:count).new
data.count = 0
Thread.current["prof"] = data
end
def work
# $redis.get 'X'
end
def method
work
end
def method_with_safety
@count += 1 if Thread.current == @thread
work
end
def method_with_thread_storage
if data = Thread.current["count"]
data.count += 1
end
work
end
end
t = Test.new
Benchmark.ips do |b|
b.report "method" do |times|
i = 0
while i < times
t.method
i += 1
end
end
b.report "method_with_safety" do |times|
i = 0
while i < times
t.method_with_safety
i += 1
end
end
b.report "method_with_thread_storage" do |times|
i = 0
while i < times
t.method_with_thread_storage
i += 1
end
end
end
There are 2 mechanisms we can account for thread safety:
-
Only run instrumentation if
Thread.current
is the thread being profiled -
Store instrumentation data on
Thread.current
(note an external registry is no faster)
We can see the faster method is using a Thread.current
comparison. However, once you start doing work the difference is quite marginal. Safety comes at a cost but it is not a ridiculous cost.
### NO WORK
method 20.119M (± 3.1%) i/s - 100.579M in 5.004172s
method_with_safety 11.254M (± 3.0%) i/s - 56.509M in 5.025966s
method_with_thread_storage
6.105M (± 3.5%) i/s - 30.559M in 5.011887s
With $redis.get 'x'
method 18.007k (± 4.8%) i/s - 91.052k in 5.069131s
method_with_safety 18.081k (± 2.9%) i/s - 91.156k in 5.045869s
method_with_thread_storage
18.042k (± 2.8%) i/s - 90.532k in 5.021889s
A generic method profiler
So now, we can piece all of the bits together use the fastest micro benches and build up a general method profiler we can use to gather metrics.
# called once to patch in instrumentation
MethodProfiler.patch(SomeClass, [:method1, :method2], :name)
SomeClass.new.method1
SomeClass.new.method2
# called when starting profiling
MethodProfiler.start
result = MethodProfiler.stop
{
total_duration: 0.2,
name: {duration: 0.111, calls: 12}
}
Which can be implemented with:
require 'benchmark/ips'
require 'redis'
class MethodProfiler
def self.patch(klass, methods, name)
patches = methods.map do |method_name|
<<~RUBY
unless defined?(#{method_name}__mp_unpatched)
alias_method :#{method_name}__mp_unpatched, :#{method_name}
def #{method_name}(*args, &blk)
unless prof = Thread.current[:_method_profiler]
return #{method_name}__mp_unpatched(*args, &blk)
end
begin
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
#{method_name}__mp_unpatched(*args, &blk)
ensure
data = (prof[:#{name}] ||= {duration: 0.0, calls: 0})
data[:duration] += Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
data[:calls] += 1
end
end
end
RUBY
end.join("\n")
klass.class_eval patches
end
def self.start
Thread.current[:_method_profiler] = {
__start: Process.clock_gettime(Process::CLOCK_MONOTONIC)
}
end
def self.stop
finish = Process.clock_gettime(Process::CLOCK_MONOTONIC)
if data = Thread.current[:_method_profiler]
Thread.current[:_method_profiler] = nil
start = data.delete(:__start)
data[:total_duration] = finish - start
data
end
end
end
$redis = Redis.new
class Test
def work
#$redis.get 'X'
end
def method
work
end
def method_unpatched
work
end
end
MethodProfiler.patch(Test, [:method], :a_thing)
MethodProfiler.start
t = Test.new
Benchmark.ips do |b|
b.report "method" do |times|
i = 0
while i < times
t.method
i += 1
end
end
b.report "method_unpatched" do |times|
i = 0
while i < times
t.method_unpatched
i += 1
end
end
end
Performance wish there is a definite impact for the “no work scenario”
(while profiling)
method 2.247M (± 1.6%) i/s - 11.370M in 5.061768s
method_unpatched 20.261M (± 3.5%) i/s - 101.226M in 5.002638s
(without profiling)
method 6.015M (± 2.9%) i/s - 30.156M in 5.017655s
method_unpatched 20.211M (± 2.1%) i/s - 101.288M in 5.013976s
However once we actually start doing work with say $redis.get 'x'
(while profiling)
method 17.419k (± 3.5%) i/s - 87.108k in 5.006821s
method_unpatched 17.627k (± 7.3%) i/s - 88.740k in 5.063757s
(without profiling)
method 17.293k (± 3.1%) i/s - 87.672k in 5.074642s
method_unpatched 17.478k (± 2.4%) i/s - 88.556k in 5.069598s
Conclusion
Building on all the micro benchmarks we can design a generic class for profiling methods quite efficiently in Ruby with very minimal runtime impact.
Feel free to borrow any of the code in this article with or without attribution and build upon it.
When I’m able to run a code, I profile it like this