monitoring: improve containers metrics
Created by: michaellzc
part of
- https://github.com/sourcegraph/sourcegraph/issues/33437
- https://github.com/sourcegraph/sourcegraph/issues/33438
This PR adds the following
- A shared observable metric to indicate how many times a container (or child processes) has been oom kill
- It's added to provisioning indicators since it's an indicator of underprovisioned resources
- A new dashboard that shows resources usage of all containers at a glance
- It contains two groups, one for all containers and one for containers that may have potential scaling issues
- the first group is extremely noisy so it's hidden by default
- the second group provides a more focused experience for site-admin to quickly find problematic resources
Test plan
connect to demo
's prom
gcloud compute start-iap-tunnel default-$NEW_DEPLOYMENT-instance 9090 --local-host-port=localhost:4445 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER
or use dogfood
kubectl port-forward svc/prometheus 4445:30090
Update dev/grafana/all/datasources.yaml
# Configuration for all non-Linux platforms.
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://docker.for.mac.localhost:4445 # use the remote prometheus instance on demo instead of the local one
isDefault: true
editable: false
- name: Jaeger
type: jaeger
access: proxy
url: http://host.docker.internal:16686/-/debug/jaeger
- name: Loki
type: loki
access: proxy
url: http://host.docker.internal:3100
Start monitoring stack
sg run grafana
see screenshot
the first group is hidden by default