Skip to content

migrations: revert migration causing deadlocks in some deployments

Administrator requested to merge sg/fix-3.19 into main

Created by: slimsag

This effectively reverts https://github.com/sourcegraph/sourcegraph/pull/12591 which contains a migration that runs outside of a transaction. This has led to a race condition in which deployments with multiple frontend instances (pure-docker, kubernetes, etc.) can be entirely prevented from running the new version of Sourcegraph either as an upgrade from v3.18 or from scratch.

This behavior only exhibited in testing the pure-docker deployment after we released v3.19.0 for server and Kubernetes deployments, due to the fact that the testing for pure-docker uses a slower VM than our other tests.

Effect of reverting this migration

Some searches on Sourcegraph.com will now be 4-40x slower until someone follows up and properly adds back this migration using a transaction and/or adjusts our migration infrastructure so this type of deadlock isn't possible.

Timeline and why reverting (not fixing) makes sense

  • @beyang and @rvantonder had a 1:1 in which Rijnard had an idea for improving search performance, they enacted this on Sourcegraph.com and found a 4-40x improvement in some search queries (great!) but did so not using a regular migration / by modifying the DB directly. https://twitter.com/beyang/status/1287826896281989120
  • @slimsag opened a PR to document manual migrations like this that we are running to avoid issues and divergences in deployments in the future. https://github.com/sourcegraph/about/pull/1314
  • @efritz opened a PR to actually add the migration in a way that should've worked and should've been done initially https://github.com/sourcegraph/sourcegraph/pull/12591
  • @uwe and @marekweb release v3.19.0 successfully and without issues.
  • @uwe was performing testing of pure-docker before releasing Docker Compose (which comes immediately after releaseing v3.19.0 for Kubernetes / server) and noticed the pure-docker Vagrant tests failing.
  • @slimsag and @uwe chose to revert this migration to resolve the issue.

We did not have this migration before, so I don't see fixing it as critical. I will defer that to someone who can run with it properly and test this issue does not re-occur.

Details on the deadlock

With just two frontend replicas starting a fresh v3.19.0 deployment from scratch, the frontend replicas both fail to start indefinitely with lock output shown here:

https://gist.github.com/slimsag/160e969a05608c5f09dc1d8e60cd5fa1

When a new deployment is started, all migrations we have run in order (from oldest to newest migration) in order to bring the DB schema up-to-date with where a new Sourcegraph deployment's DB schema should be.

The migration tool we use relies on transactions and the premise that given multiple frontend instances all trying to perform the migration at once that one will "win" and the others will "lose" and ultimately restart to find the DB does not need migration anymore. i.e.:

  • frontend-1: "can't migrate, locked" -> "restart" -> "DB does not need migration" -> "started"
  • frontend-2: "acquired lock, migrating" -> "migrated" -> "started"

But the up migration in https://github.com/sourcegraph/sourcegraph/pull/12591 cannot be run in a transaction because it is a concurrently created index (i.e. we want to create the index while still serving the table).

This introduced the possibility that two frontends are blocked on eachother acquiring the same DB table locks:

ERROR: Failed to migrate the DB. Please contact [email protected] for further assistance: migration failed: deadlock detected, Process 60 waits for ShareLock on virtual transaction 4/6; blocked by process 61.
Process 61 waits for ExclusiveLock on advisory lock [16384,0,3651352207,1]; blocked by process 60. in line 0: -- Note: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
CREATE INDEX CONCURRENTLY IF NOT EXISTS repo_name_idx ON public.repo USING btree (lower(name::text) COLLATE pg_catalog."C");
 (details: pq: deadlock detected)

I believe the failure is something like:

frontend-1: begins migrating
  -> tries to begin CREATE INDEX CONCURRENTLY without a transaction, blocked on ShareLock for `repo` table
    -> deadlock!
  -> frontend-2 begins migrating, blocked on ExclusiveLock for `repo` table
    -> deadlock!

Merge request reports

Loading