Sourcegraph monitoring (high level / overview)

Created by: slimsag

It occurs to me that I could better communicate my high level vision for improved monitoring of Sourcegraph by site admins (and us on Sourcegraph.com). This is my attempt at doing so 😃

My hope is to give visibility to @sourcegraph/core-services (since we chatted this morning about e.g. relevance of including search perf metrics), @christinaforney as it relates to customers more broadly and everyone else who is interested.

Monitoring tools today

If you understand what monitoring tooling / setup looks like today, skip this section

Today we use a multitude of monitoring/debugging tools in various different ways and deployments:

Tracing: Provided by Lightstep and Jaeger
- Example: A specific request of yours is slow or broken, tracing lets you see what happened behind the scenes with that specific request.
Metrics: Provided by Prometheus
- Example: "Over the last two months, what was the 95th percentile request latency?" -> Prometheus answers this
Dashboards: Provided by Grafana
- Example: You want to make a good overview of the data in Prometheus
Analytics/tracking: Provided by HubSpot
- Example: How many users complete our activation user flow on Sourcegraph.com?
Logging: Either no log consumer or Google Stackdriver / whatever the cloud host provides.
In-app metrics: These are only used for communicating usage stats, user surveys, etc.

The story for most deployments is not a good one:

Tool	Sourcegraph.com	Kubernetes cluster	Docker container	deploy-sourcegraph-docker
Tracing	✅ (inconsistent)	Optional (undocumented, broken)	❌	✅
Metrics	✅(incomplete)	Optional	❌	✅(incomplete)
Dashboards	❌(broken)	❌	❌	✅
Analytics/tracking	Optional	Optional	Optional	Optional
Logging	Google StackDriver	Up to cloud host	Up to cloud host	Up to cloud host
In-app metrics	✅	✅	✅	✅

It was just recently that deploy-sourcegraph-docker had a good story and we agreed to standardize on some tooling like a single tool for tracing instead of varying ones.
Work needs to be done to take the deploy-sourcegraph-docker niceness and transfer it to our Kubernetes cluster deployments (most of our customers).
- This would apply to Sourcegraph.com inherently.
For Docker container, it is unclear how much we can/want to do, but given a substantial portion of our customers run this we likely want to do something, so: https://github.com/sourcegraph/sourcegraph/issues/4259

Monitoring issues today

Customers want to know when Sourcegraph is having issues.
When Sourcegraph is having issues, they want to potentially be alerted about them so that someone can look into it or alert us.
When Sourcegraph is having issues, they want to be able to narrow down the problem and have a good understanding of how bad the problem is (is everything broken? just some parts? is it just running a little slow? etc)

How would it look?

Main dashboard

This main dashboard would show you:

Sourcegraph is green -> everything's good.
Sourcegraph is yellow -> something may be up, things may be a little slower than usual -- but it might solve itself and there is no need to actively investigate
Sourcegraph is red -> something is broken. You should rollback, contact support, etc.

Narrowing

Narrowing down the above, you can see the health of each major service we have:

frontend
searcher
gitserver
repo-updater
query-runner
etc.

Maybe something like 0% to 100% healthy as a graph.

Accompanying documentation would describe the importance of each of those and that e.g. query-runner having issues means saved searches may not work but other things are OK for example.

in 3.5, this health will be computed based on some service-specific metrics but mostly just based on whether or not most requests to the service are failing or not. This will give us good high-level data but not service/team-specific data.
Any service above can define new Prometheus metrics (of course) that play into this health metric. The expectation is that long term these are based on very service (and team) specific metrics:
- repo-updater health goes down if we're having trouble updating or cloning repos, hitting code host API limits.
- searcher health goes down if searches are super slow for some reason.
- gitserver health goes down if we're running out of disk space, or it can't connect to the codehost.
- searcher team adds some caching somewhere, to monitor it we add search-specific metrics and they contribute to the overall health that searcher has etc.

The point of this is that we can get (and continue to be) really specific with our metrics, and still keep them at a high-level capable of being alerted on and legible to site admins who don't know what a specific metric is, e.g., I can understand searcher is operating at 50% health, but I as a site admin can't understand what "15 failures of searcher zip archive cache fetch requests per minute" means -- obviously it's bad, but how bad? (we get to define that! 😃)

The only thing we have to be cautious of is defining these such that they are not noisy, after all we're talking about e.g. customers being alerted "Sourcegraph is broken!" so when we say that we want to be positive in some way (and this makes it super important for us to use these same metrics on our deployments for alerting).

Even more specific

So you're a site admin, Sourcegraph is yellow or red, it says searcher's health is 50% but why?

This is where the service-specific dashboards come in. Each service/team should display the most important things there (we're pretty close to that already, we just have a few missing metrics and dashboards). These are what customers would e.g. send us a screenshot of when something is going wrong.

407948923