Skip to content

insights: retry queries that encounter `shard-timeout` events

Warren Gifford requested to merge insights/shard-timeout-skipped-to-alert into main

Created by: leonore

closes #36226 (closed)

also adds logging for skipped reasons in the JIT path as we didn't have it there

Test plan

Artificially added a shard timeout event in search's progress handler (couldn’t figure out how to hit it locally).

	skipped := []Skipped{
		{
			Reason:  ShardTimeout,
			Title:   "shard-timeout-leo",
			Message: "i'm fake!",
		},
	}

Observed it from search: Screenshot 2022-05-31 at 14 27 59

Observed it from JIT insight: Screenshot 2022-05-31 at 14 32 40

And then created a backend insight, saw the error in the insights_query_runner_job queue, which eventually succeeded after retrying once the artificial event was removed and the retry period had passed. Screenshot 2022-05-31 at 15 35 40

Once this is landed there will be a post-merge test on k8s.sgdev to see if backfill works better

Merge request reports

Loading