blackbox exporter & site 24/7 next steps

Created by: davejrt

Summary Historically we have used site24x7 to monitor sourcegraph.com, however it has also been the source of some ephemeral alerts that has been cause for concern and reduced faith in its reliability see here. In an effort to reduce our dependency on site24x7 and improve our on call experience and our monitoring reliability, blackbox exporter was deployed. At the time of raising this issue, each endpoint configured in site24x7 is replicated in blackbox exporter, all of which are configured to alert if a non 200 http status code is not received for 5 minutes.

Site 24x7 The good:

Be default we get a bunch of useful metrics out of the box, without having to configure anything specific, aside from the endpoint itself (response time broken down into dns resolution, download, ssl etc)
The ui is relatively simple to use
Summary of outages and the RCA image is sometimes handy to see where things are breaking down.
Testing from multiple locations
Cost?

The bad:

We get monitoring up until cloudflare, which is to say that there's no visibility of external vs internal issues in the same platform. As a result, it's hard to know if the issue is with cloudflare or with the sourcegraph frontend and where to look next.

Blackbox exporter The good:

There's a lot of knobs we can turn and tune here to gather exactly the information we find most useful.
We can easily see on the same grafana dashboard all of the endpoints we have configured per metric
We get internal and external monitoring of endpoints out of the same service (that is to say we can configure a ping to hit an exernal service as well as the services themselves)
Free

The bad:

There is some intitial setup required in order to determine what metrics we want, on which endpoints, all of which need to be configured manually in the various config files (blackbox exporter and prometheus, as well as any further alertmanager config that may be required).
We essentially only get one source of truth here, as we are testing from inside our own cluster, out to the world then back again. We do not get multiple locations like site24x7
There will be ongoing maintenance, which at this stage is hard to say what that will be, however one could assume the more metrics we gather, the more we need to maintain it and ensure they're up to date.

What's next

Determine what we consider to be missing from blackbox exporter that we currently have in site24x7, considering the metrics out of the box that are displayed on site24x7 vs what we can and need to configure in blackbox exporter.

Ensure that alerting is actually working from blackbox exporter, and that it's a trustworthy source. For example, a comparison in the most recent outage reported by site24x7 is here and here the same graph time period from blackbox exporter for both the external endpoint as well as the frontend service.