Skip to content

monitoring: generate solutions documentation from Go monitoring definitions

Administrator requested to merge sg/generated-solutions into master

Created by: slimsag

This PR makes it required to list possible solutions (or "none") as part of defining an Observable. These snippets of Markdown documentation are then compiled automatically into generated documentation that serves as a starting point for any admins encountering these alerts.

For example, with this observable definition:

https://github.com/sourcegraph/sourcegraph/blob/2041ba23789cbb3208064091bb046dc0ecad20b9/monitoring/frontend.go#L13-L27

We get this documentation generated:

frontend: 99th_percentile_search_request_duration

Descriptions:

  • frontend: 20s+ 99th percentile successful search request duration over 5m

Possible solutions:

  • Get details on the exact queries that are slow by configuring "observability.logSlowSearches": 20, in the site configuration and looking for frontend warning logs prefixed with slow search request for additional details.
  • Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
  • Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the indexed-search.Deployment.yaml if regularly hitting max CPU utilization.
  • Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.

You can view the entire generated documentation here: https://github.com/sourcegraph/sourcegraph/blob/2041ba23789cbb3208064091bb046dc0ecad20b9/doc/admin/observability/alert_solutions.md

For now, there is some repetitiveness (as we repeat how to solve CPU/memory/container restarts per-service), but where this really shines is with more service-specific metrics like the above. In practice, users will just cmd+f to find the alerts firing on their instance.

Merge request reports

Loading