monitoring: generate solutions documentation from Go monitoring definitions
Created by: slimsag
This PR makes it required to list possible solutions (or "none"
) as part of defining an Observable
. These snippets of Markdown documentation are then compiled automatically into generated documentation that serves as a starting point for any admins encountering these alerts.
For example, with this observable definition:
We get this documentation generated:
frontend: 99th_percentile_search_request_duration
Descriptions:
- frontend: 20s+ 99th percentile successful search request duration over 5m
Possible solutions:
- Get details on the exact queries that are slow by configuring
"observability.logSlowSearches": 20,
in the site configuration and looking forfrontend
warning logs prefixed withslow search request
for additional details.- Check that most repositories are indexed by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- Kubernetes: Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the
indexed-search.Deployment.yaml
if regularly hitting max CPU utilization.- Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization.
You can view the entire generated documentation here: https://github.com/sourcegraph/sourcegraph/blob/2041ba23789cbb3208064091bb046dc0ecad20b9/doc/admin/observability/alert_solutions.md
For now, there is some repetitiveness (as we repeat how to solve CPU/memory/container restarts per-service), but where this really shines is with more service-specific metrics like the above. In practice, users will just cmd+f to find the alerts firing on their instance.