Skip to content

Kubernetes probes report failures when doing deployments

Created by: pecigonzalo

When doing a deployment, some containers log Pod indexed-search-4 in prod: Readiness probe failed: Get http://10.164.7.104:6070/healthz: read tcp 10.164.7.1:33362->10.164.7.104:6070: read: connection reset by peer. Initially, we thought it was the initial readiness probes, which we increased the limit for in https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/3289 as our nodes take around 60s to start serving traffic. This greatly reduced the number of failure events, but still got a single event per pod.

While investigating further, we noticed the logs for probe failed come before the Started container logs and found a bug in Kubernetes which matches the behavior we are experiencing.

I 2020-08-28T13:44:47Z Started container zoekt-webserver 
I 2020-08-28T13:44:47Z Created container zoekt-webserver 
I 2020-08-28T13:44:47Z Container image "index.docker.io/sourcegraph/indexed-searcher:insiders@sha256:d2e87635cf48c4c5d744962540751022013359bc00a9fb8e1ec2464cc6a0a2b8" already present on machine 
I 2020-08-28T13:44:39Z Successfully assigned prod/indexed-search-4 to gke-cloud-cloud-node-pool-8dac0f63-l3wk 
W 2020-08-28T13:44:30Z Readiness probe failed: Get http://10.164.7.104:6070/healthz: read tcp 10.164.7.1:33362->10.164.7.104:6070: read: connection reset by peer 
W 2020-08-28T13:44:30Z Readiness probe failed: Get http://10.164.7.104:6070/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 
I 2020-08-28T13:44:24Z Stopping container zoekt-indexserver 
I 2020-08-28T13:44:24Z Stopping container jaeger-agent 
I 2020-08-28T13:44:24Z Stopping container zoekt-webserver 

Upstream bug: https://github.com/kubernetes/kubernetes/issues/52817