Skip to content

Atomic migrations

Administrator requested to merge atomic-migrations into clean-up-migrate-db

Created by: chrismwendt

tl;dr this makes migrations atomic, which ensures that successful migrations always leave the DB in a clean state.

Prior to this change, it was possible for the frontend to disconnect from Postgres halfway through a migration (e.g. when the migration takes >5 minutes and k8s kills the frontend because it's not responding), leaving the DB marked as being in a dirty state despite the migration eventually succeeding.

After this change, each migration will set dirty=false within the transaction to guarantee that when a migration succeeds the DB will be in a clean state.

More context:

A few alternatives:

  • Accept that migrations can fail in this way and document how to fix the DB when this happens
  • Use my patched version of golang-migrate until a satisfactory solution has been upstreamed
  • Mitigation: increase the timeout for DB migrations
  • Don't merge slow migrations into master, somehow check perf before merging
  • Workaround 1: SET version=..., dirty=false at the end of each migration in our migrations/ directory
  • Workaround 2 (implemented in this PR): splice that statement into the migration at runtime by intercepting the migration loader
  • Workaround 3: inline the ~40 lines of relevant code from golang-migrate and patch that in our source

3 teammates supported workaround 3:

I vote for workaround 3, since we’re postgres only and the value-add from that lib is not that high. We can also do more postgres specific things if we go workaround 3 route and do stuff more tailored towards our exact needs.

Upon further consideration, it became more clear to me that workaround 2 would be better:

@eric and I chatted about this in depth and realized that workaround 3 is not very desirable because it would not play well with the migrate command line tool.

Workaround 2 is more compatible and doesn't change the control flow as much.

Given that, we plan to proceed with workaround 2.

This PR is based on https://github.com/sourcegraph/sourcegraph/pull/7905, which is expected to be merged before this PR.

Merge request reports

Loading