monitoring: migrate out-of-band alerts to the generator
Created by: bobheadxi
Migrate all out-of-band alert rules defined by deploy-sourcegraph
to the generator, such that each alert now has a panel as part of a service dashboard like all our other alerts, with a few changes:
- previously multi-service alerts are now defined per-service as per monitoring pillars (enforced by the generator), since a multi-service alert would require multi-service dashboards
- k8s monitoring in a new group (currently, only pods availability)
And a few additions:
- go_gc_duration to go along with go_goroutines + new "golang runtime monitoring" group for relevant services
- an entire prometheus + alertmanager dashboard (to go with
prometheus_metrics_bloat
)
And a few exceptions:
- gitserver alerts were not migrated (seems captured by existing gitserver observables)
- k8s node disk space remaining alerts were not migrated (seems nonfunctional - see thread)
Closes https://github.com/sourcegraph/sourcegraph/issues/12117
Try it out
kubectl port-forward svc/prometheus 9090:30090 -n prod
- Check out
monitoring/migrate-out-of-band-alerts
go generate ./monitoring
./dev/grafana.sh
- http://localhost:3370
TODOs
-
A few alerts use for
, so I'm going to go complete #12336 first and add the appropriatefor
parameters here -
Write up possible solutions for alerts that might have some -
Improve panel options where possible
Merge request reports
Activity
Created by: pecigonzalo
previously multi-service alerts are now defined per-service as per monitoring pillars
Is tihs the 2nd pillar? I don't think it applies here, I believe this was also closely related to defining alerts in Grafana and the map between alerts and dashboards (cc/ @slimsag). Unless we need different thresholds for each, I dont think we need to define per-service alerts/recording rules. wdyt?
Related: https://www.robustperception.io/undoing-the-benefits-of-labels
Created by: bobheadxi
@pecigonzalo to have alerts span multiple services, we'd need a multi-service dashboard. We could forgo the dashboard entirely and make the generator capable of creating alerts without dashboards at all, but I'm not sure how I feel about that - this is not explicitly stated in the pillars but I like the idea of each alert is associated with a specific panel for a specific service so that it is useful to site admins
The monitoring dashboards here are also similar to our container monitoring/provisioning monitoring dashboards, where each service gets its own dashboard
Created by: pecigonzalo
to have alerts span multiple services, we'd need a multi-service dashboard.
Is this a rule set by our current generator? As I interpreted the pillar, it defines that a graph should have an associated alert, but not that an alert necessarily needs a graph.
If we want to create an dashboard per service that shows service alerts, we should not need to split alerts into multiple alert rules, I believe we could use one of the labels to filter on each graph.
Created by: bobheadxi
As I interpreted the pillar, it defines that a graph should have an associated alert, but not that an alert necessarily needs a graph.
I think requiring a dashboard per alert makes sense as well, otherwise what we are alerting on is essentially invisible. For example, our disk space alerts have been nonfunctional without an obvious way to realize that (can only tell what we are alerting on by going into config/prometheus itself, and then we have to dive into the specific metric).
Would like to hear what @slimsag thinks about this and defer to his judgement :)
If we want to create an dashboard per service that shows service alerts, we should not need to split alerts into multiple alert rules, we should be able to use one of the labels to filter on each graph.
This is definitely an improvement that could be made! However, I think this is an implementation detail that is not within the scope of this PR to fix (since such an improvement would apply to many alerts we have) - I'm focusing on simply moving the alerts here for now
Created by: pecigonzalo
I think requiring a dashboard per alert makes sense as well
I agree that we should link to a relevant dashboard, graph or/and document, my comment is around what the monitoring pillars required as the reason why we are splitting into an alert rule per service.
This is definitely an improvement that could be made! However, I think this is an implementation detail that is not within the scope of this PR to fix (since such an improvement would apply to many alerts we have) - I'm focusing on simply moving the alerts here for now
I understood from "previously multi-service alerts are now defined per-service as per monitoring pillars, since a multi-service alert would require multi-service dashboards" that we were implementing that split into multiple per-service alerts here.
Created by: slimsag
Sounds like the confusion here was cleared up? If not let me know.
You can find why I believe we should only allow 1 dashboard per service here: https://about.sourcegraph.com/handbook/engineering/distribution/observability/monitoring_pillars#faq-why-can-t-i-create-an-ad-hoc-dashboard-e-g-an-http-dashboard
Created by: codecov[bot]
Codecov Report
Merging #12391 into master will decrease coverage by
0.03%
. The diff coverage is71.15%
.@@ Coverage Diff @@ ## master #12391 +/- ## ========================================== - Coverage 50.61% 50.58% -0.04% ========================================== Files 1423 1417 -6 Lines 80436 79283 -1153 Branches 6838 6552 -286 ========================================== - Hits 40711 40103 -608 + Misses 36169 35728 -441 + Partials 3556 3452 -104
Flag Coverage Δ #go 51.97% <71.27%> (-0.39%)
#integration 24.25% <0.00%> (+0.10%)
#storybook 12.63% <0.00%> (?)
#typescript 46.85% <0.00%> (+0.86%)
#unit 47.20% <71.15%> (-0.39%)
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files Coverage Δ browser/src/shared/code-hosts/github/codeHost.ts 60.40% <0.00%> (-0.41%)
cmd/frontend/backend/site_admin.go 0.00% <0.00%> (ø)
cmd/frontend/graphqlbackend/campaigns.go 0.00% <0.00%> (ø)
cmd/frontend/graphqlbackend/json.go 47.36% <0.00%> (-5.58%)
cmd/frontend/graphqlbackend/saved_searches.go 33.73% <0.00%> (ø)
cmd/frontend/graphqlbackend/search.go 66.27% <0.00%> (-0.68%)
cmd/frontend/graphqlbackend/search_alert.go 28.36% <0.00%> (-0.79%)
cmd/frontend/graphqlbackend/search_results.go 43.55% <0.00%> (+0.16%)
cmd/frontend/internal/app/debugproxies/scanner.go 38.20% <0.00%> (-1.34%)
cmd/query-runner/email.go 0.00% <0.00%> (ø)
... and 206 more