Managing db schema changes without downtime
over 6 years ago
At Discourse we have always been huge fans of continuous deployment. Every commit we make heads to our continuous integration test suite. If all the tests pass (ui, unit, integration, smoke) we automatically deploy the latest version of our code to https://meta.discourse.org.
This pattern and practice we follow allows the thousands self-installers out there to safely upgrade to the tests-passed
version whenever they feel like it.
Because we deploy so often we need to take extra care not to have any outages during deployments. One of the most common reasons for outages during application deployment is database schema changes.
The problem with schema changes
Our current deployment mechanism roughly goes as follows:
- Migrate database to new schema
- Bundle up application into a single docker image
- Push to registry
- Spin down old instance, pull new instance, spin up new instance (and repeat)
If we ever create an incompatible database schema we risk breaking all the old application instances running older versions of our code. In practice, this can lead to tens of minutes of outage!
In ActiveRecord the situation is particularly dire cause in production the database schema is cached and any changes in schema that drop or rename columns very quickly risk breaking every query to the affected model raising invalid schema exceptions.
Over the years we have introduced various patterns to overcome this problem and enable us to deploy schema changes safely, minimizing outages.
Tracking rich information about migrations
ActiveRecord has a table called schema_migrations
where is stores information about migrations that ran.
Unfortunately the amount of data stored in this table is extremely limited, in fact it boils down to:
connection.create_table(table_name, id: false) do |t|
t.string :version, version_options
end
The table has a lonely column storing the “version” of migrations that ran.
- It does not store when the migration ran
- It does not store how long it took the migration to run
- It has nothing about the version of Rails that was running when the migration ran
This lack of information, especially, not knowing when stuff ran makes creating clean systems for dealing with schema changes hard to build. Additionally, debugging strange and wonderful issues with migrations is very hard without rich information.
Discourse, monkey patches Rails to log rich information about migrations:
Our patch provides us a very rich details surrounding all the migration circumstances. This really should be in Rails.
Defer dropping columns
Since we “know” when all previous migrations ran due to our rich migration logging, we are able to “defer drop” columns.
What this means is that we can guarantee we perform dangerous schema changes after we know that the new code is in place to handle the schema change.
In practice if we wish to drop a column we do not use migrations for it. Instead our db/seed takes care of defer dropping.
These defer drops will happen at least 30 minutes after the particular migration referenced ran (in the next migration cycle), giving us peace of mind that the new application code is in place.
If we wish to rename a column we will create a new column, duplicate the value into the new column, mark the old column readonly using a trigger and defer drop old column.
If we wish to drop or rename a table we follow a similar pattern.
The logic for defer dropping lives in ColumnDropper and TableDropper.
Not trusting ourselves
A big problem with spectacular special snowflake per-application practices is enforcement.
We have great patterns for ensuring safety, however sometimes people forget that we should never drop a column or a table the ActiveRecord migration way.
To ensure we never make the mistake of committing dangerous schema changes into our migrations, we patch the PG gem to disallow certain statements when we run them in the context of a migration.
Want to DROP TABLE
? Sorry, an exception will be raised. Want to DROP
a column, an exception will be raised.
This makes it impractical to commit highly risky schema changes without following our best practices:
== 20180321015226 DropRandomColumnFromUser: migrating =========================
-- remove_column(:categories, :name)
WARNING
-------------------------------------------------------------------------------------
An attempt was made to drop or rename a column in a migration
SQL used was: 'ALTER TABLE "categories" DROP "name"'
Please use the deferred pattrn using Migration::ColumnDropper in db/seeds to drop
or rename columns.
Note, to minimize disruption use self.ignored_columns = ["column name"] on your
ActiveRecord model, this can be removed 6 months or so later.
This protection is in place to protect us against dropping columns that are currently
in use by live applications.
rake aborted!
StandardError: An error has occurred, this and all later migrations canceled:
Attempt was made to rename or delete column
/home/sam/Source/discourse/db/migrate/20180321015226_drop_random_column_from_user.rb:3:in `up'
Tasks: TOP => db:migrate
(See full trace by running task with --trace)
This logic lives in safe_migrate.rb. Since this is a recent pattern we only enforce it for migrations after a certain date.
Alternatives
Some of what we do is available in gem form and some is not:
Strong Migrations offers enforcement. It also takes care of a bunch of interesting conditions like nudging you to create indexes concurrently in postgres. Enforcement is done via patching active record migrator, meaning that if anyone does stuff with SQL direct it will not be caught.
Zero downtime migrations very similar to strong migrations.
Outrigger allows you to tag migrations. This enables you to amend your deploy process so some migrations run pre-deploy and some run post-deploy. This is the simplest technique for managing migrations in such a way that you can avoid downtimes during deploy.
Handcuffs: very similar to outrigger, define phases for your migrations
What should you do?
Our current pattern for defer dropping columns and tables works for us, but is not yet ideal. Code that is in charge of “seeding” data now is also in charge of amending schema and timing of column drops is not as tightly controlled as it should be.
On the upside, rake db:migrate
is all you need to run and it works magically all the time. Regardless of how you are hosted and what version your schema is at.
My recommendation though for what I would consider best practice here is a mixture of a bunch of ideas. All of it belongs in Rails.
Enforcement of best practices belongs in Rails
I think enforcement of safe schema changes should be introduced into ActiveRecord. This is something everyone should be aware of. It is practical to do zero downtime deploys today with schema changes.
class RemoveColumn < ActiveRecord::Migration[7.0]
def up
# this should raise an error
remove_column :posts, :name
end
end
To make it work, everyone should be forced to add the after_deploy
flag to the migration:
class RemoveColumn < ActiveRecord::Migration[7.0]
after_deploy! # either this, or disable the option globally
def up
# this should still raise if class Post has no ignored_columns: [:name]
remove_column :posts, :name
end
end
class RemoveColumn < ActiveRecord::Migration[7.0]
after_deploy!(force: true)
def up
# this should work regardless of ignored_columns
remove_column :posts, :name
end
end
I also think the ideal enforcement is via SQL analysis, however it is possible that this is a bit of a can-of-worms at Rails scale. For us it is practical cause we only support one database.
rake db:migrate
should continue to work just as it always did.
For backwards compatibility rake db:migrate should run all migrations including after_deploy
migrations. Applications who do not care about “zero downtime deploys” should also be allowed to opt out of the safety.
New post and pre migrate rake tasks should be introduced
To run all the application code compatible migrations you would run:
rake db:migrate:pre
# runs all migrations without `after_deploy!`
To run all the destructive operations you would run:
rake db:migrate:post
# runs all migrations with `after_deploy!`
Conclusion
If you are looking to start with “safe” zero downtime deploys today I would recommend:
-
Amending build process to run pre deploy migrations and post deploy migrations (via Outrigger or Handcuffs)
-
Introduce an enforcement piece with Strong Migrations
How about conditional migrations? I don’t see any tools or mentions of it
For example
The migration won’t run if condition is not met, so all you need to do is to run
rake db:migrate
multiple times. Or don’t do anything and wait for the next deploy.This achieves everything the same but with much smaller interface and changes to the deployment pipeline