Tests that sometimes fail - flaky test tips

almost 5 years ago

A liar will not be believed, even when he speaks the truth. : Aesop

Once you have a project that is a few years old with a large test suite an ugly pattern emerges.

Some tests that used to always work, start “sometimes” working. This starts slowly, “oh that test, yeah it sometimes fails, kick the build off again”. If left unmitigated it can very quickly snowball and paralyze an entire test suite.

Most developers know about this problem and call these tests “non deterministic tests”, “flaky tests”,“random tests”, “erratic tests”, “brittle tests”, “flickering tests” or even “heisentests”.

Naming is hard, it seems that this toxic pattern does not have a well established unique and standard name. Over the years at Discourse we have called this many things, for the purpose of this article I will call them flaky tests, it seems to be the most commonly adopted name.

Much has been written about why flaky tests are a problem.

Martin Fowler back in 2011 wrote:

Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite.

To this I would like to add that flaky tests are an incredible cost to businesses. They are very expensive to repair often requiring hours or even days to debug and they jam the continuous deployment pipeline making shipping features slower.

I would like to disagree a bit with Martin. Sometimes I find flaky tests are useful at finding underlying flaws in our application. In some cases when fixing a flaky test, the fix is in the app, not in the test.

In this article I would like to talk about patterns we observed at Discourse and mitigation strategies we have adopted.

Patterns that have emerged at Discourse

A few months back we introduced a game.

We created a topic on our development Discourse instance. Each time the test suite failed due to a flaky test we would assign the topic to the developer who originally wrote the test. Once fixed the developer who sorted it out would post a quick post morterm.

This helped us learn about approaches we can take to fix flaky tests and raised visibility of the problem. It was a very important first step.

Following that I started cataloging the flaky tests we found with the fixes at: https://review.discourse.org/tags/heisentest

Recently, we built a system that continuously re-runs our test suite on an instance at digital ocean and flags any flaky tests (which we temporarily disable).

Quite a few interesting patterns leading to flaky tests have emerged which are worth sharing.

Hard coded ids

Sometimes to save doing work in tests we like pretending.

user.avatar_id = 1
user.save!

# then amend the avatar
user.upload_custom_avatar!

# this is a mistake, upload #1 never existed, so for all we know
# the legitimate brand new avatar we created has id of 1. 
assert(user.avatar_id != 1)

This is more or less this example here.

Postgres often uses sequences to decide on the id new records will get. They start at one and keep increasing.

Most test frameworks like to rollback a database transaction after test runs, however the rollback does not roll back sequences.

ActiveRecord::.transaction do
   puts User.create!.id
   # 1
   raise ActiveRecord::Rollback
puts 

puts User.create!.id
# 2

This has caused us a fair amount of flaky tests.

In an ideal world the “starting state” should be pristine and 100% predictable. However this feature of Postgres and many other DBs means we need to account for slightly different starting conditions.

This is the reason you will almost never see a test like this when the DB is involved:

t = Topic.create!
assert(t.id == 1)

Another great, simple example is here.

Random data

Occasionally flaky tests can highlight legitimate application flaws. An example of such a test is here.

data = SecureRandom.hex
explode if data[0] == "0"

Of course nobody would ever write such code. However, in some rare cases the bug itself may be deep in the application code, in an odd conditional.

If the test suite is generating random data it may expose such flaws.

Making bad assumptions about DB ordering

create table test(a int)
insert test values(1)
insert test values(2)

I have seen many times over the years cases where developers (including myself) incorrectly assumed that if you select the first row from the example above you are guaranteed to get 1.

select a from test limit 1

The output of the SQL above can be 1 or it can be 2 depending on a bunch of factors. If one would like guaranteed ordering then use:

select a from test order by a limit 1

This problem assumption can sometimes cause flaky tests, in some cases the tests themselves can be “good” but the underlying code works by fluke most of the time.

An example of this is here another one is here.

A wonderful way of illustrating this is:

[8] pry(main)> User.order('id desc').find_by(name: 'sam').id
  User Load (7.6ms)  SELECT  "users".* FROM "users" WHERE "users"."name" = 'sam' ORDER BY id desc LIMIT 1
=> 25527
[9] pry(main)> User.order('id').find_by(name: 'sam').id
  User Load (1.0ms)  SELECT  "users".* FROM "users" WHERE "users"."name" = 'sam' ORDER BY id LIMIT 1
=> 2498
[10] pry(main)> User.find_by(name: 'sam').id
  User Load (0.6ms)  SELECT  "users".* FROM "users" WHERE "users"."name" = 'sam' LIMIT 1
=> 9931

Even if the clustered index primary key is on id you are not guaranteed to retrieve stuff in id order unless you explicitly order.

Incorrect assumptions about time

My test suite is not flaky, excepts from 11AM UTC till 1PM UTC.

A very interesting thing used to happen with some very specific tests we had.

If I ever checked in code around 9:50am, the test suite would sometimes fail. The problem was that 10am in Sydney is 12am in UTC time (daylight savings depending). That is exactly the time that the clock shifted in some reports causing some data to be in the “today” bucket and other data in the “yesterday” bucket.

This meant that if we chucked data into the database and asked the reports to “bucket” it the test would return incorrect numbers at very specific times during the day. This is incredibly frustrating and not particularly fair on Australia that have to bear the brunt.

An example is here (though the same code went through multiple iterations previously to battle this).

The general solution we have for the majority of these issues is simply to play pretend with time. Test pretends it is 1PM UTC in 2018, then does something, winds clock forward a bit and so on. We use our freeze time helper in Ruby and Sinon.JS in JavaScript. Many other solutions exist including timecop, the fascinating libfaketime and many more.

Other examples I have seen are cases where sleep is involved:

sleep 0.001
assert(elapsed < 1)

It may seem obvious that that I slept for 1 millisecond, clearly less than 1 second passed. But this obvious assumption can be incorrect sometimes. Machines can be under extreme load causing CPU scheduling holdups.

Another time related issue we have experienced is insufficient timeouts, this has plagued our JS test suite. Many integration tests we have rely on sequences of events; click button, then check for element on screen. As a safeguard we like introducing some sort of timeout so the JS test suite does not hang forever waiting for an element to get rendered in case of bugs. Getting the actual timeout duration right is tricky. On a super taxed AWS instance that Travis CI provides much longer timeouts are needed. This issue sometimes is intertwined with other factors, a resource leak may cause JS tests to slowly require longer and longer time.

Leaky global state

For tests to work consistently they often rely on pristine initial state.

If a test amends global variables and does not reset back to the original state it can cause flakiness.

An example of such a spec is here.

class Frog
   cattr_accessor :total_jumps
   attr_accessor :jumps

   def jump
     Frog.total_jumps = (Frog.total_jumps || 0) + 1
     self.jumps = (self.jumps || 0) + 1
   end
end

# works fine as long as this is the first test
def test_global_tracking
   assert(Frog.total_jumps.nil?)
end

def test_jumpy
   frog = Frog.new
   frog.jump
   assert(frog.jumps == 1)
end

Run test_jumpy first and then test_global_tracking fails. Other way around works.

We tend to hit these types of failures due to distributed caching we use and various other global registries that the tests interact with. It is a balancing act cause on one hand we want our application to be fast so we cache a lot of state and on the other hand we don’t want an unstable test suite or a test suite unable to catch regressions.

To mitigate we always run our test suite in random order (which makes it easy to pick up order dependent tests). We have lots of common clean up code to avoid the situations developers hit most frequently. There is a balancing act, our clean up routines can not become so extensive that they cause major slowdown to our test suite.

Bad assumptions about the environment

It is quite unlikely you would have a test like this in your test suite.

def test_disk_space
   assert(free_space_on('/') > 1.gigabyte)
end

That said, hidden more deeply in your code you could have routines that behaves slightly differently depending on specific machine state.

A specific example we had is here.

We had a test that was checking the internal implementation of our process for downloading images from a remote source. However, we had a safeguard in place that ensured this only happened if there was ample free space on the machine. Not allowing for this in the test meant that if you ran our test suite on a machine strained for disk space tests would start failing.

We have various safeguards in our code that could depend on environment and need to make sure we account for them when writing tests.

Concurrency

Discourse contains a few subsystems that depend on threading. The MessageBus that powers live updates on the site, cache synchronization and more uses a background thread to listen on a Redis channel. Our short lived “defer” queue powers extremely short lived non-critical tasks that can run between requests and hijacked controller actions that tend to wait long times on IO (a single unicorn worker can sometimes serve 10s or even 100s of web requests in our setup). Our background scheduler handles recurring jobs.

An example would be here.

Overall, this category is often extremely difficult to debug. In some cases we simply disable components in test mode to ensure consistency, the defer queue runs inline. We also evict threaded component out of our big monolith. I find it significantly simpler to work through and repair a concurrent test suite for a gem that takes 5 seconds to run vs repairing a sub-section in a giant monolith that has a significantly longer run time.

Other tricks I have used is simulating an event loop, pulsing it in tests simulating multiple threads in a single thread. Joining threads that do work and waiting for them to terminate and lots of puts debugging.

Resource leaks

Our JavaScript test suite integration tests have been amongst the most difficult tests to stabilise. They cover large amounts of code in the application and require Chrome web driver to run. If you forget to properly clean up a few event handlers, over thousands of tests this can lead to leaks that make fast tests gradually become very slow or even break inconsistently.

To work through these issues we look at using v8 heap dumps after tests, monitoring memory usage of chrome after the test suite runs.

It is important to note that often these kind of problems can lead to a confusing state where tests consistently work on production CI yet consistently fail on resource strained Travis CI environment.

Mitigation patterns

Over the years we have learned quite a few strategies you can adopt to help grapple with this problem. Some involve coding, others involve discussion. Arguably the most important first step is admitting you have a problem, and as a team, deciding how to confront it.

Start an honest discussion with your team

How should you deal with flaky tests? You could keep running them until they pass. You could delete them. You could quarantine and fix them. You could ignore this is happening.

At Discourse we opted to quarantine and fix. Though to be completely honest, at some points we ignored and we considered just deleting.

I am not sure there is a perfect solution here.

“Deleting and forgetting” can save money at the expense of losing a bit of test coverage and potential app bug fixes. If your test suite gets incredibly erratic, this kind of approach could get you back to happy state. As developers we are often quick to judge and say “delete and forget” is a terrible approach, it sure is drastic and some would judge this to be lazy and dangerous. However, if budgets are super tight this may be the only option you have. I think there is a very strong argument to say a test suite of 100 tests that passes 100% of the time when you rerun it against the same code base is better than a test suite of 200 tests where passing depends on a coin toss.

“Run until it passes” is another approach. It is an attempt to have the cake and eat it at the same time. You get to keep your build “green” without needing to fix flaky tests. Again, it can be considered somewhat “lazy”. The downside is that this approach may leave broken application code in place and make the test suite slower due to repeat test runs. Also, in some cases, “run until it passes” may fail on CI consistently and work on local consistently. How many retries do you go for? 2? 10?

“Do nothing” which sounds shocking to many, is actually surprisingly common. It is super hard to let go of tests you spent time carefully writing. Loss aversion is natural and means for many the idea of losing a test may just be too much to cope with. Many just say “the build is a flake, it sometimes fails” and kick it off again. I have done this in the past. Fixing flaky tests can be very very hard. In some cases where there is enormous amounts of environment at play and huge amounts of surface area, like large scale full application integration tests hunting for the culprit is like searching for a needle in a haystack.

“Quarantine and fix” is my favourite general approach. You “skip” the test and have the test suite keep reminding you that a test was skipped. You lose coverage temporarily until you get around to fixing the test.

There is no, one size fits all. Even at Discourse we sometimes live between the worlds of “Do nothing” and “Quarantine and fix”.

That said, having an internal discussion about what you plan to do with flaky tests is critical. It is possible you are doing something now you don’t even want to be doing, it could be behaviour that evolved.

Talking about the problem gives you a fighting chance.

If the build is not green nothing gets deployed

At Discourse we adopted continuous deployment many years ago. This is our final shield. Without this shield our test suite could have gotten so infected it would likely be useless now.

Always run tests in random order

From the very early days of Discourse we opted to run our tests in random order, this exposes order dependent flaky tests. By logging the random seed used to randomise the tests you can always reproduce a failed test suite that is order dependent.

Sadly `rspec bisect` has been of limited value

One assumption that is easy to make when presented with flaky tests, is that they are all order dependent. Order dependent flaky tests are pretty straightforward to reproduce. You do a binary search reducing the amount of tests you run but maintain order until you find a minimal reproduction. Say test #1200 fails with seed 7, after a bit of automated magic you can figure out that the sequence #22,#100,#1200 leads to this failure. In theory this works great but there are 2 big pitfalls to watch out for.

You may have not unrooted all your flaky tests, if the binary search triggers a different non-order dependent test failure, the whole process can fail with very confusing results.
From our experience with our code base the majority of our flaky tests are not order dependent. So this is usually an expensive wild goose chase.

Continuously hunt for flaky tests

Recently Roman Rizzi introduced a new system to hunt for flaky tests at Discourse. We run our test suite in a tight loop, over and over again on a cloud server. Each time tests fail we flag them and at the end of a week of continuous running we mark flaky specs as “skipped” pending repair.

This mechanism increased test suite stability. Some flaky specs may only show up 1 is 1000 runs. At snail pace, when running tests once per commit, it can take a very long time to find these rare flakes.

Quarantine flaky tests

This brings us to one of the most critical tools at your disposal. “Skipping” a flaky spec is a completely reasonable approach. There are though a few questions you should explore:

Is the environment flaky and not the test? Maybe you have a memory leak and the test that failed just hit a threshold?
Can you decide with confidence using some automated decision metric that a test is indeed flaky

There is a bit of “art” here and much depends on your team and your comfort zone. My advice here though would be to be more aggressive about quarantine. There are quite a few tests over the years I wish we quarantined earlier, which cause repeat failures.

Run flaky tests in a tight loop randomizing order to debug

One big issue with flaky tests is that quite often they are very hard to reproduce. To accelerate a repro I tend to try running a flaky test in a loop.

100.times do
   it "should not be a flake" do
      yet_it_is_flaky
   end
end

This simple technique can help immensely finding all sorts of flaky tests. Sometimes it makes sense to have multiple tests in this tight loop, sometimes it makes sense to drop the database and Redis and start from scratch prior to running the tight loop.

Invest in a fast test suite

For years at Discourse we have invested in speeding up to our test suite. There is a balancing act though, on one hand the best tests you have are integration tests that cover large amounts of application code. You do not want the quest for speed to compromise the quality of your test suite. That said there is often large amount of pointless repeat work that can be eliminated.

A fast test suite means

It is faster for you to find flaky tests
It is faster for you to debug flaky tests
Developers are more likely to run the full test suite while building pieces triggering flaky tests

At the moment Discourse has 11,000 or so Ruby tests it takes them 5m40s to run single threaded on my PC and 1m15s or so to run tests concurrently.

Getting to this speed involves a regular amount of “speed maintenance”. Some very interesting recent things we have done:

Daniel Waterworth introduced test-prof into our test suite and refined a large amount of tests to use: the let_it_be helper it provides (which we call fab! cause it is awesome and it fabricates). Prefabrication can provide many of the speed benefits you get from fixtures without inheriting the many of the limitations fixtures prescript.
David Taylor introduced the parallel tests gem which we use to run our test suite concurrently saving me 4 minutes or so each time I run the full test suite. Built-in parallel testing is coming to Rails 6 thanks to work by Eileen M. Uchitelle and the Rails core team.

On top of this the entire team have committed numerous improvements to the test suite with the purpose of speeding it up. It remains a priority.

Add purpose built diagnostic code to debug flaky tests you can not reproduce

A final trick I tend to use when debugging flaky tests is adding debug code.

An example is here.

Sometimes, I have no luck reproducing locally no matter how hard I try. Diagnostic code means that if the flaky test gets triggered again I may have a fighting chance figuring out what state caused it.

def test_something
   make_happy(user)
   if !user.happy
      STDERR.puts "#{user.inspect}"
   end
    assert(user.happy)
end

Let’s keep the conversation going!

Do you have any interesting flaky test stories? What is your team’s approach for dealing with the problem? I would love to hear more so please join the discussion on this blog post.

Extra reading

Eradicating Non-Determinism in Tests by Martin Fowler
Google: Where do our flaky tests come from?
An Empirical Analysis of Flaky Tests (pdf) - Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, Darko Marinov
Microsoft: Eliminating Flaky tests
Flaky tests at GitLab
iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests by Wing Lam, Reed Oei, August Shi, Darko Marinov, Tao Xie

Posted by: Sam Permalink | Comments (11)

Comments

Sean Riordan almost 5 years ago

Great article!
A few others i’ve run into are:
Unique Data - often there is a hidden dependency on unique objects/data within your test arrangement that will flake-on-you unless explicitly set. Another reason to avoid random values.
Specific Ordering - sometimes we expect specific ordering of an array (for instance) when no such ordering is required.
Transaction meets SQL: tests that run within a transaction are usually faster because they avoid hitting the database, but sometimes tests will include use of SQL directly, which your transaction doesn’t know about, and will leave your global db state dirty.

Regarding mitigation patterns, I use the “quarantine & fix” when a spec fails back-to-back, however, if a spec fails out of the blue and doesn’t seem to return, I use a “3 strikes and your skipped” which is a hybrid of “do nothing (tracking the failure in a google sheet)” and “quarantine & fix” after 3 failures. This has proven to help passively prioritize which specs I immediately quarantine/fix.

James Farrier almost 5 years ago

I used to work at a company with over 10,000 tests where we weren’t able to get more than an 80% pass rate due to flaky tests. This article is great and covers a lot of the options for handling flaky tests. I founded Appsurify to make it easy for companies to handle flaky tests, with minimal effort.

First, don’t delete them, flaky tests are still valuable and can still find bugs. We also had the challenge where a lot of the ‘flakiness’ was not the test or the application’s fault but was caused by 3rd party providers. Even at Google “Almost 16% of our tests have some level of flakiness associated with them!” - John Micco, so just writing tests that aren’t flaky isn’t always possible.

Appsurify automatically raises defects when tests fail, and if the failure reason looks to be ‘flakiness’ (based on failure type, when the failure occurred, the change being made, previous known flaky failures) then we raise the defect as a “flaky” defect. Teams can then have the build fail based only on new defects and prevent it from failing when there are flaky test results.

We also prioritize the tests, which causes fewer tests to be run which are more likely to fail due to a real defect, which also reduces the number of flaky test results.

Sam Saffron almost 5 years ago

Yeah we have had these as well quite a few times. In our case usually the root cause was usually DB ordering assumptions. It is funny to see how dealing with this pattern evolved on the test side (for us)

expect(user_ids).to eq([1,2,3])

# which then sometimes became

expect(user_ids.sort).to eq([1,2,3])

# and we have kind of settled on this now cause the errors we get are cleanest

expect(user_ids).to contain_exactly([1,2,3])

Curious, how are the mechanics of this implemented, we always run tests in transactions but any raw SQL we run also run in the same transaction, at the end of the test we rewind back to a position in the transaction. This technique allows us to have many tests share DB state.

- Transaction starts 

common test subset DB setup here

- Transaction checkpoint

run test1 here

- Rewind to checkpoint

run test2 here

- Rewind to checkpoint

- Transaction ends

Sean Riordan almost 5 years ago

Regarding transaction behavior, in my case, i’m working with RSpec on a Ruby/Rails app. The transactions are managed with the DatabaseCleaner gem, and the issues described are likely limited to its implementation. Rewinding to different points within a single transaction sounds really interesting, and i’m not familiar with comparable options in RSpec testing - but now you’ve got me curious!

Andrew Kozin almost 5 years ago

Regarding the conflict between hardcoded ids and database sequences, I would propose a better solution. The trick is easy, you could just set a minimal value for every sequence before starting tests.

For example, you can set those values to 1000000, and then enjoy using any id below this number explicitly — it won’t conflict with an id autogenerated by the database.

That’s how the trick works on Ruby on Rails / Postgresl

Samuel Cochran almost 5 years ago

We went on a similar journey with flakey tests a while back. Most of them were database state issues, or race conditions in feature specs. Keith posted some of the tools we added to help identify and fix these flakey failures and maybe they’d be useful to you, too! Although I suspect you do most of these in Discourse anyway.

Sam Saffron almost 5 years ago

That is a very nice article, lots of good stuff there have not come across it.

I love the takeaway there " Treat your test code just like production code", I feel a lot of the slippery slope we find ourselves going through involves

Tests failed …
Oh it is just tests, does not matter.

The article by Keith talks about the importance of debugging, I 100% agree there, getting good logs can mean the difference between spending 5 minutes on a flaky test to multiple hours.

Kind of mixed on a strong requirement to reset sequences and super happy we get away without a database cleaner in Discourse cause we simply rewind PG transactions for cheaper fabrications of db chains. For us to have a DB state leak we need multiple threads involved.

@Andrew_Kozin regarding using 1,000,000 its one of those solutions that really irks developers though, in practice, works 100% of the time. If you are going with that kind of pattern a simpler trick is using negative numbers which are always out of sequence: User.create!(id: -1, name: 'bob') - some code base (Discourse included, does make some assumptions about negative ids, so your mileage may vary here)

almost 5 years ago

Nice write-up. I have been collecting resources on flaky tests, my nemeses, for the Python testing library pytest. Collected here: Flaky tests — pytest documentation

Sam Saffron almost 5 years ago

Nice, thanks for that extra resource, lots of good stuff there!

Vic Wu almost 5 years ago

I like your article!
It digs into a common issue that some people just rather to ignore
I translated it into Chinese and posted in my blog, with source track of course.
Hope you are happy with it.

Sam Saffron almost 5 years ago

Thank you @Ke_Wu! I am happy with it

Sam Saffron