Skip to content

The report-a-bug page should include a list of alerts that fired in the last 7 days, queried from existing Prometheus data

Created by: slimsag

https://k8s.sgdev.org/site-admin/report-bug

This page should include a JSON entry which indicates over the last 7 days which alerts fired in Grafana and when, to let us easily see which alerts may be firing intermittently or constantly on a customer instance:

"alerts": [
  {"timestamp": "...", "service_name": "gitserver", "name": "low_disk_space", value: "0"},
  {"timestamp": "...", "service_name": "gitserver", "name": "low_disk_space", value: "1"},
]

I believe this data could be easily acquired from here and just using Go to hit the Prometheus admin API.

I would want to know:

  1. That their instance has the alerts defined (i.e. when the value is zero, don't exclude it)
  2. When the alert count changed, if at all. Do not include repeated information (e.g. I want to be able to read the JSON in an editor and make sense of it)

This is important because it gives us a way to get this information from customers without going through a "screenshot the home dashboard and then if any alerts fired I'll ask you again to screenshot another page to tell me what alert that actually was" -- and because these alerts are going to be more and more important going forward.

This is easier to add than the packaging of a full metrics dump, and easier to add than broadcasting this information up to sourcegraph.com 24/7 AND we would need this anyway for more privacy-conscious customers who disable pings and would refuse to send us a full metrics dump.