monitoring: improve containers metrics (!34808) · Merge requests · Administrator / sourcegraph

Created by: michaellzc

part of

This PR adds the following

A shared observable metric to indicate how many times a container (or child processes) has been oom kill
- It's added to provisioning indicators since it's an indicator of underprovisioned resources
A new dashboard that shows resources usage of all containers at a glance
- It contains two groups, one for all containers and one for containers that may have potential scaling issues
- the first group is extremely noisy so it's hidden by default
- the second group provides a more focused experience for site-admin to quickly find problematic resources

Test plan

connect to demo's prom

gcloud compute start-iap-tunnel default-$NEW_DEPLOYMENT-instance 9090 --local-host-port=localhost:4445 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER

or use dogfood

kubectl port-forward svc/prometheus 4445:30090

Update dev/grafana/all/datasources.yaml

# Configuration for all non-Linux platforms.
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://docker.for.mac.localhost:4445 # use the remote prometheus instance on demo instead of the local one
    isDefault: true
    editable: false
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://host.docker.internal:16686/-/debug/jaeger
  - name: Loki
    type: loki
    access: proxy
    url: http://host.docker.internal:3100

Start monitoring stack

sg run grafana

see screenshot

the first group is hidden by default

monitoring: improve containers metrics

Test plan

Merge request reports