monitoring for CPU and memory usage in Kubernetes
Created by: slimsag
Today, we have container CPU and memory monitoring on each service dashboard e.g. at the bottom of https://k8s.sgdev.org/-/debug/grafana/d/frontend/frontend?orgId=1
As it says, however, this is not available in Kubernetes deployments yet. In docker-compose deployments we use cadvisor to export these metrics for Prometheus to scrape:
Historically, I believe we had node_exporter which exposed some CPU and memory container metrics but not as much as cadvisor did itself (internally, I understand Kubernetes runs cadvisor but I think perhaps it can be an outdated version or something?)
From an initial glance, I think we should add cadvisor to https://github.com/sourcegraph/deploy-sourcegraph by doing something like what is done in the official cadvisor kubernetes deployment here: https://github.com/google/cadvisor/tree/master/deploy/kubernetes
This will ensure we have the closest / most similar Prometheus metrics as in our docker-compose deployments, so the existing queries on the Grafana dashboards should likely just work (or, if not, it should be easy to tweak them in a way that works in both environments).
One challenge, however, is that we want to make sure:
- It is in the base deployment, for all stock / out-of-the-box Kubernetes deployments: https://github.com/sourcegraph/deploy-sourcegraph/tree/master/base
- We edit the non-privileged overlay here so that it disables / removes the cadvisor deployment we add in
base/
, since it will require privileged access to the Kubernetes cluster.