Skip to content

Prometheus metrics for when inter-service communication is failing

Created by: slimsag

This is for "Sourcegraph monitoring v0"

Today, our services talk to each-other internally but we currently do not have any Prometheus metrics that tell us when this is failing. Historically this has not been much of a problem for us because:

  1. Our Kubernetes deployments are pretty pre-configured such that all services can talk to each-other and we don't usually need to worry about anything between them.
  2. We historically haven't run into a ton of issues where such an issue occurs.

However, recently two things make this valuable:

  1. We've run into cases in our dogfood and customer environments recently where this would have helped.
    • dogfood: Searcher being overloaded inaccessible due to it (the same thing could happen in any prod or customer environment, be it searcher or another critical service like gitserver).
    • https://app.hubspot.com/contacts/2762526/company/407948923: repo-updater was "running" by all observable metrics (docker ps, etc) but was actually deadlocked on a migration that was stuck and went multiple days unnoticed until users reported issues.
  2. deploy-sourcegraph-docker, which https://app.hubspot.com/contacts/2762526/company/407948923 uses, it is easily possible to accidentally misconfigure the service connection env vars since they are manually specified. As we scale out deployments like this to multiple nodes, knowing things are connected properly becomes super important.

From a technical POV, any service that talks to another service should just also have a ping endpoint and a goroutine + prometheus metric which periodically pings.