Skip to content

Use TTL for HTTP cache to stop Redis running OOM

Warren Gifford requested to merge http_cache_ttl into master

Created by: mrnugget

(See this Slack message to get the full context)

This changes the httputil.Cache that's used by repo-updater when doing HTTP requests to set an expiration date on the cache entries it adds to Redis.

Why? Read on...

We discovered that on Sourcegraph.com the redis-cache POD would crash because it ran out of memory.

Redis' INFO command reported that only 212 out of 650k keys have an expiration date:

db0:keys=656993,expires=212,avg_ttl=2489492

That's interesting for a Redis instance that's used for caching purposed. And as it turns out, there are a few places in the codebase that use our internal rcache package to use Redis as a cache. Most of them use rcache.NewWithTTL which sets a TTL on every cache entry.

But the httputil package uses rcache.New — without a TTL. Everything that's added through that caching layer is persisted forever in Redis if not deleted manually.

Where is it used? repo-updater uses it in its standard HTTP client factory:

https://sourcegraph.sgdev.org/sourcegraph/sourcegraph@9c5ba9f/-/blob/cmd/repo-updater/repos/util.go#L37

The resulting client is then used when talking to external services (GitHub, Amazon CodeCommit, etc.).

That explains where the majority of the keys in Sourcegraph.com instance comes from. They all have the v1:http: prefix used by the httputil.Cache.

That also means that the memory usage of Redis is basically unbounded. Granted, there are a few places where the used httpcache deletes something from the cache:

https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/gregjones/httpcache%24+t.Cache.Delete

But that depends on the actual cache headers sent in the HTTP responses that are being cached and it only happens for requests that have been done twice. So one-off requests stay in Redis forever. Others potentially forever.


Two questions to reviewers:

  • Could this explain why customers were seeing memory usage grow unbounded on their instances?
  • Why did this cache not use a TTL even though the comment above it mentions a TTL of a week?

(cc @nicksnyder @beyang since they might have historical context, since the commit that added this got lost)

Merge request reports

Loading