RFC 174: HA Postgres for Sourcegraph Cloud
Created by: tsenart
Context
This is an umbrella issue to capture work and ideas around RFC 174: HA Postgres for Sourcegraph Cloud.
TODO
Testing
-
Setup Cloud SQL Postgres deployment -
Configure access to the instance via Cloud SQL Proxy -
Backup current dotcom postgres database as SQL Dump -
Import that data into Cloud SQL instance - [ ] Configure preferred maintenance window ( bigger question, what is the best time for Sourcegraph.com to undergo maintenance based on where our users are located) -
Manually test connectivity from a pod in the prod namespace to the Cloud SQL instance Service with psql and Cloud SQL Proxy. -
Manually test that failover works smoothly (i.e. Sourcegraph keeps working)
Rollout
- Use Terraform to create the instance
- Use a gradual rollout strategy to update pods with Cloud SQL Proxy to minimize disruption
- Make production sql instance read-only
- Export data from psql pods -> GCS bucket
- Import data into Cloud SQL
- Rollout update
PG*env vars to change the database - Take on-call for 2 hrs to ensure the rollout was successful on Sourcegraph.com
Cleanup
-
Remove pgsql deployment from Sourcegraph.com after 48 hours of Cloud SQL being deployed -
Ensure PGTUNE values are migrated as well -
Remove persistent disk in GCP