Kubernetes probes report failures when doing deployments
Created by: pecigonzalo
When doing a deployment, some containers log Pod indexed-search-4 in prod: Readiness probe failed: Get http://10.164.7.104:6070/healthz: read tcp 10.164.7.1:33362->10.164.7.104:6070: read: connection reset by peer
.
Initially, we thought it was the initial readiness probes, which we increased the limit for in https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/3289 as our nodes take around 60s to start serving traffic. This greatly reduced the number of failure events, but still got a single event per pod.
While investigating further, we noticed the logs for probe failed
come before the Started container
logs and found a bug in Kubernetes which matches the behavior we are experiencing.
I 2020-08-28T13:44:47Z Started container zoekt-webserver
I 2020-08-28T13:44:47Z Created container zoekt-webserver
I 2020-08-28T13:44:47Z Container image "index.docker.io/sourcegraph/indexed-searcher:insiders@sha256:d2e87635cf48c4c5d744962540751022013359bc00a9fb8e1ec2464cc6a0a2b8" already present on machine
I 2020-08-28T13:44:39Z Successfully assigned prod/indexed-search-4 to gke-cloud-cloud-node-pool-8dac0f63-l3wk
W 2020-08-28T13:44:30Z Readiness probe failed: Get http://10.164.7.104:6070/healthz: read tcp 10.164.7.1:33362->10.164.7.104:6070: read: connection reset by peer
W 2020-08-28T13:44:30Z Readiness probe failed: Get http://10.164.7.104:6070/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I 2020-08-28T13:44:24Z Stopping container zoekt-indexserver
I 2020-08-28T13:44:24Z Stopping container jaeger-agent
I 2020-08-28T13:44:24Z Stopping container zoekt-webserver
Upstream bug: https://github.com/kubernetes/kubernetes/issues/52817