Skip to content

RFC 174: HA Postgres for Sourcegraph Cloud

Created by: tsenart

Context

This is an umbrella issue to capture work and ideas around RFC 174: HA Postgres for Sourcegraph Cloud.

TODO

Testing

  • Setup Cloud SQL Postgres deployment
  • Configure access to the instance via Cloud SQL Proxy
  • Backup current dotcom postgres database as SQL Dump
  • Import that data into Cloud SQL instance - [ ] Configure preferred maintenance window ( bigger question, what is the best time for Sourcegraph.com to undergo maintenance based on where our users are located)
  • Manually test connectivity from a pod in the prod namespace to the Cloud SQL instance Service with psql and Cloud SQL Proxy.
  • Manually test that failover works smoothly (i.e. Sourcegraph keeps working)

Rollout

  • Use Terraform to create the instance
  • Use a gradual rollout strategy to update pods with Cloud SQL Proxy to minimize disruption
  • Make production sql instance read-only
  • Export data from psql pods -> GCS bucket
  • Import data into Cloud SQL
  • Rollout update PG* env vars to change the database
  • Take on-call for 2 hrs to ensure the rollout was successful on Sourcegraph.com

Cleanup

  • Remove pgsql deployment from Sourcegraph.com after 48 hours of Cloud SQL being deployed
  • Ensure PGTUNE values are migrated as well
  • Remove persistent disk in GCP