How we manage schema changes at Discourse minimizing downtime


At Discourse we have always been huge fans of continuous deployment. Every commit we make heads to our continuous integration test suite. If all the tests pass (ui, unit, integration, smoke) we automatically deploy the latest version of our code to https://meta.discourse.org.

This pattern and practice we follow allows the thousands self-installers out there to safely upgrade to the tests-passed version whenever they feel like it.

Because we deploy so often we need to take extra care not to have any outages during deployments. One of the most common reasons for outages during application deployment is database schema changes.

The problem with schema changes

Our current deployment mechanism roughly goes as follows:

  • Migrate database to new schema
  • Bundle up application into a single docker image
  • Push to registry
  • Spin down old instance, pull new instance, spin up new instance (and repeat)

If we ever create an incompatible database schema we risk breaking all the old application instances running older versions of our code. In practice, this can lead to tens of minutes of outage! :boom:

In ActiveRecord the situation is particularly dire cause in production the database schema is cached and any changes in schema that drop or rename columns very quickly risk breaking every query to the affected model raising invalid schema exceptions.

Over the years we have introduced various patterns to overcome this problem and enable us to deploy schema changes safely, minimizing outages.

Tracking rich information about migrations

ActiveRecord has a table called schema_migrations where is stores information about migrations that ran.

Unfortunately the amount of data stored in this table is extremely limited, in fact it boils down to:

connection.create_table(table_name, id: false) do |t|
  t.string :version, version_options
end

The table has a lonely column storing the “version” of migrations that ran.

  1. It does not store when the migration ran
  2. It does not store how long it took the migration to run
  3. It has nothing about the version of Rails that was running when the migration ran

This lack of information, especially, not knowing when stuff ran makes creating clean systems for dealing with schema changes hard to build. Additionally, debugging strange and wonderful issues with migrations is very hard without rich information.

Discourse, monkey patches Rails to log rich information about migrations:

Our patch provides us a very rich details surrounding all the migration circumstances. This really should be in Rails.

Defer dropping columns

Since we “know” when all previous migrations ran due to our rich migration logging, we are able to “defer drop” columns.

What this means is that we can guarantee we perform dangerous schema changes after we know that the new code is in place to handle the schema change.

In practice if we wish to drop a column we do not use migrations for it. Instead our db/seed takes care of defer dropping.

These defer drops will happen at least 30 minutes after the particular migration referenced ran (in the next migration cycle), giving us peace of mind that the new application code is in place.

If we wish to rename a column we will create a new column, duplicate the value into the new column, mark the old column readonly using a trigger and defer drop old column.

If we wish to drop or rename a table we follow a similar pattern.

The logic for defer dropping lives in ColumnDropper and TableDropper.

Not trusting ourselves

A big problem with spectacular special snowflake per-application practices is enforcement.

We have great patterns for ensuring safety, however sometimes people forget that we should never drop a column or a table the ActiveRecord migration way.

To ensure we never make the mistake of committing dangerous schema changes into our migrations, we patch the PG gem to disallow certain statements when we run them in the context of a migration.

Want to DROP TABLE? Sorry, an exception will be raised. Want to DROP a column, an exception will be raised.

This makes it impractical to commit highly risky schema changes without following our best practices:

== 20180321015226 DropRandomColumnFromUser: migrating =========================
-- remove_column(:categories, :name)

WARNING
-------------------------------------------------------------------------------------
An attempt was made to drop or rename a column in a migration
SQL used was: 'ALTER TABLE "categories" DROP "name"'
Please use the deferred pattrn using Migration::ColumnDropper in db/seeds to drop
or rename columns.

Note, to minimize disruption use self.ignored_columns = ["column name"] on your
ActiveRecord model, this can be removed 6 months or so later.

This protection is in place to protect us against dropping columns that are currently
in use by live applications.
rake aborted!
StandardError: An error has occurred, this and all later migrations canceled:

Attempt was made to rename or delete column
/home/sam/Source/discourse/db/migrate/20180321015226_drop_random_column_from_user.rb:3:in `up'
Tasks: TOP => db:migrate
(See full trace by running task with --trace)

This logic lives in safe_migrate.rb. Since this is a recent pattern we only enforce it for migrations after a certain date.

Alternatives

Some of what we do is available in gem form and some is not:

Strong Migrations offers enforcement. It also takes care of a bunch of interesting conditions like nudging you to create indexes concurrently in postgres. Enforcement is done via patching active record migrator, meaning that if anyone does stuff with SQL direct it will not be caught.

Zero downtime migrations very similar to strong migrations.

Outrigger allows you to tag migrations. This enables you to amend your deploy process so some migrations run pre-deploy and some run post-deploy. This is the simplest technique for managing migrations in such a way that you can avoid downtimes during deploy.

Handcuffs: very similar to outrigger, define phases for your migrations

What should you do?

Our current pattern for defer dropping columns and tables works for us, but is not yet ideal. Code that is in charge of “seeding” data now is also in charge of amending schema and timing of column drops is not as tightly controlled as it should be.

On the upside, rake db:migrate is all you need to run and it works magically all the time. Regardless of how you are hosted and what version your schema is at.

My recommendation though for what I would consider best practice here is a mixture of a bunch of ideas. All of it belongs in Rails.

Enforcement of best practices belongs in Rails

I think enforcement of safe schema changes should be introduced into ActiveRecord. This is something everyone should be aware of. It is practical to do zero downtime deploys today with schema changes.

class RemoveColumn < ActiveRecord::Migration[7.0]
  def up
     # this should raise an error
     remove_column :posts, :name
  end
end

To make it work, everyone should be forced to add the after_deploy flag to the migration:

class RemoveColumn < ActiveRecord::Migration[7.0]
  after_deploy! # either this, or disable the option globally 
  def up
     # this should still raise if class Post has no ignored_columns: [:name]
     remove_column :posts, :name
  end
end
class RemoveColumn < ActiveRecord::Migration[7.0]
  after_deploy!(force: true)
  def up
     # this should work regardless of ignored_columns
     remove_column :posts, :name
  end
end

I also think the ideal enforcement is via SQL analysis, however it is possible that this is a bit of a can-of-worms at Rails scale. For us it is practical cause we only support one database.

rake db:migrate should continue to work just as it always did.

For backwards compatibility rake db:migrate should run all migrations including after_deploy migrations. Applications who do not care about “zero downtime deploys” should also be allowed to opt out of the safety.

New post and pre migrate rake tasks should be introduced

To run all the application code compatible migrations you would run:

rake db:migrate:pre
# runs all migrations without `after_deploy!`

To run all the destructive operations you would run:

rake db:migrate:post
# runs all migrations with `after_deploy!`

Conclusion

If you are looking to start with “safe” zero downtime deploys today I would recommend:

  1. Amending build process to run pre deploy migrations and post deploy migrations (via Outrigger or Handcuffs)

  2. Introduce an enforcement piece with Strong Migrations

Comments

Anton Katunin 7 months ago
Anton Katunin

How about conditional migrations? I don’t see any tools or mentions of it

For example

class RenameColumn < ActiveRecord::Migration[7.0]
  def up
     add_column :posts, :new_name, :string
    # copy from posts.name to posts.new_name
  end
end
class RemoveColumn < ActiveRecord::Migration[7.0]
  on_next_deploy_after RenameColumn
  # or even 
  when RenameColumn, deployed_for: 1.week

  def up
     remove_column :posts, :name
  end
end

The migration won’t run if condition is not met, so all you need to do is to run rake db:migrate multiple times. Or don’t do anything and wait for the next deploy.

This achieves everything the same but with much smaller interface and changes to the deployment pipeline

Sam Saffron 7 months ago
Sam Saffron

Absolutely, that is why we are not too fast to move our current system to my recommendation which requires changes to the deploy pipeline. At our scale changing the pipeline is something that requires extreme levels of care.

The trouble with our current system and the proposal for deferring migrations based on previous deploys is that you end up having laxer control over schema. When you cough up the pain and amend the pipeline you get to junk columns at a consistent time which is pretty desirable.

That said I do not think this is a zero sum game and one size must fit all.

In some cases deferring drops based on previous migration times works fine, it has served us very well over the past few years. But… if I had a time machine… I would prefer the consistency granted by amending the pipeline to drop objects post deploy.

Mark 7 months ago
Mark

Do you know of any efforts to add metadata like this? It does seem like a nice addition

Sam Saffron 7 months ago
Sam Saffron

I reached out on twitter, but have not heard anything back yet. :sob:

Landon Wilkins 7 months ago
Landon Wilkins

Rubocop for ensuring remove_column migrations are tagged as :post_deploy:

(feels like this belongs in outrigger, will make a commit)

Michael Swart 6 months ago
Michael Swart

We do something very similar.

The environment is different (Microsoft Stack) and the terminology is different. We call them blue-green deployments but with an “aqua” piece for the database.

But everything you wrote sounds familiar and works well.

Sam Saffron 6 months ago
Sam Saffron

I really like that you have a re-runnable migration for the post job. With PG we generally keep migrations in a transaction but then do column dropping in a re-runnable fashion.

I think its super nice to opt-in/opt-out of doing pre and post in transaction vs not cause there is merit to both.


comments powered by Discourse