Atomic migrations
Created by: chrismwendt
tl;dr this makes migrations atomic, which ensures that successful migrations always leave the DB in a clean state.
Prior to this change, it was possible for the frontend to disconnect from Postgres halfway through a migration (e.g. when the migration takes >5 minutes and k8s kills the frontend because it's not responding), leaving the DB marked as being in a dirty state despite the migration eventually succeeding.
After this change, each migration will set dirty=false
within the transaction to guarantee that when a migration succeeds the DB will be in a clean state.
More context:
- Motivating prod error https://sourcegraph.slack.com/archives/C0J618TTM/p1579282575001900
- Investigation https://sourcegraph.slack.com/archives/C07KZF47K/p1579287343014500
- Filed an issue on golang-migrate https://github.com/golang-migrate/migrate/issues/325
- Submitted a PR to golang-migrate https://github.com/golang-migrate/migrate/pull/326, probably won't get merged soon
- Problem statement and alternatives https://sourcegraph.slack.com/archives/C07KZF47K/p1579546621063800
A few alternatives:
- Accept that migrations can fail in this way and document how to fix the DB when this happens
- Use my patched version of golang-migrate until a satisfactory solution has been upstreamed
- Mitigation: increase the timeout for DB migrations
- Don't merge slow migrations into master, somehow check perf before merging
- Workaround 1: SET version=..., dirty=false at the end of each migration in our migrations/ directory
- Workaround 2 (implemented in this PR): splice that statement into the migration at runtime by intercepting the migration loader
- Workaround 3: inline the ~40 lines of relevant code from golang-migrate and patch that in our source
3 teammates supported workaround 3:
I vote for workaround 3, since we’re postgres only and the value-add from that lib is not that high. We can also do more postgres specific things if we go workaround 3 route and do stuff more tailored towards our exact needs.
Upon further consideration, it became more clear to me that workaround 2 would be better:
@eric and I chatted about this in depth and realized that workaround 3 is not very desirable because it would not play well with the
migrate
command line tool.Workaround 2 is more compatible and doesn't change the control flow as much.
Given that, we plan to proceed with workaround 2.
This PR is based on https://github.com/sourcegraph/sourcegraph/pull/7905, which is expected to be merged before this PR.