Skip to content

monitoring: improve containers metrics

Administrator requested to merge 05-02-monitoring_improve_containers_metrics into main

Created by: michaellzc

part of

This PR adds the following

  • A shared observable metric to indicate how many times a container (or child processes) has been oom kill
    • It's added to provisioning indicators since it's an indicator of underprovisioned resources
  • A new dashboard that shows resources usage of all containers at a glance
    • It contains two groups, one for all containers and one for containers that may have potential scaling issues
    • the first group is extremely noisy so it's hidden by default
    • the second group provides a more focused experience for site-admin to quickly find problematic resources

Test plan

connect to demo's prom

gcloud compute start-iap-tunnel default-$NEW_DEPLOYMENT-instance 9090 --local-host-port=localhost:4445 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER

or use dogfood

kubectl port-forward svc/prometheus 4445:30090

Update dev/grafana/all/datasources.yaml

# Configuration for all non-Linux platforms.
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://docker.for.mac.localhost:4445 # use the remote prometheus instance on demo instead of the local one
    isDefault: true
    editable: false
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://host.docker.internal:16686/-/debug/jaeger
  - name: Loki
    type: loki
    access: proxy
    url: http://host.docker.internal:3100

Start monitoring stack

sg run grafana

see screenshot

the first group is hidden by default

CleanShot 2022-05-04 at 10 48 26

CleanShot 2022-05-04 at 10 48 03

CleanShot 2022-05-04 at 11 25 13

Merge request reports

Loading