Investigate / propose monitoring federation
Created by: slimsag
It would be ideal if we knew from instances in the wild:
- How many / which alerts are firing over time (
alert_count
), for improving our alerting thresholds out of the box and promoting warning alerts to critical ones safely. - Aggregate search latency contrasted with resource availability of services (CPU/memory/IO/replica count), for improving search performance and the resource estimator's estimates.
- Aggregate error vs. success rates of search, to correlate with resource availability of services
This could potentially be done via a very narrow whitelist of Prometheus metrics and labels that would be sent via our regular pinging mechanism to Sourcegraph.com and then consumed via Prometheus on Sourcegraph.com or something.