Skip to content

Investigate / propose monitoring federation

Created by: slimsag

It would be ideal if we knew from instances in the wild:

  • How many / which alerts are firing over time (alert_count), for improving our alerting thresholds out of the box and promoting warning alerts to critical ones safely.
  • Aggregate search latency contrasted with resource availability of services (CPU/memory/IO/replica count), for improving search performance and the resource estimator's estimates.
  • Aggregate error vs. success rates of search, to correlate with resource availability of services

This could potentially be done via a very narrow whitelist of Prometheus metrics and labels that would be sent via our regular pinging mechanism to Sourcegraph.com and then consumed via Prometheus on Sourcegraph.com or something.