Redis AOF file grows unbounded due to frequent container restarts
Created by: nicksnyder
Issue description written by @tsenart and copied here from another private issue.
Background
$CUSTOMER had a P0 incident due to the Redis AOF file not getting compacted as it should. It seems that the reason this happened is due to frequent container restarts.
The default Redis configuration for when the AOF file should get compacted is:
# Automatic rewrite of the append only file.
# Redis is able to automatically rewrite the log file implicitly calling
# BGREWRITEAOF when the AOF log size grows by the specified percentage.
#
# This is how it works: Redis remembers the size of the AOF file after the
# latest rewrite (if no rewrite has happened since the restart, the size of
# the AOF at startup is used).
#
# This base size is compared to the current size. If the current size is
# bigger than the specified percentage, the rewrite is triggered. Also
# you need to specify a minimal size for the AOF file to be rewritten, this
# is useful to avoid rewriting the AOF file even if the percentage increase
# is reached but it is still pretty small.
#
# Specify a percentage of zero in order to disable the automatic AOF
# rewrite feature.
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
In particular, if no rewrite has happened since the restart, the size of the AOF at startup is used
is of particular concern and would explain what we're seeing. If the Docker container restarts frequently enough, redis won't have the chance to reach 100% of the previous AOF base size, so the next time around it starts up, it takes the AOF file size at that point as the new base over which it needs to double again. This can go ad-nauseum, theoretically.
Remediation
$CUSTOMER is setting up monitoring of disk space usage and has a working playbook for manually triggering AOF file compaction.
Resolution
As far as I understand it, this seems like a design issue in Redis and it should only affect our Docker image deployment because of all the restarts that other processes can trigger.
Action Items:
-
Create a work-around on our side: Instead of relying on automatic background compaction, we can trigger it every-time after redis comes up. This can be done via the sourcegraph-frontend, or through a redis run script in sourcegraph/server
. -
Open an issue (or find one) in Redis upstream. -
Figure out why the Docker container is restarting so often. See Relevant Logs section for a possible hint.
Incident Log
https://sourcegraph.slack.com/archives/GDS3PJ8GK/p1554732996001200
Relevant Logs
Similarly to what $CUSTOMER is seeing, our Dogfood docker deployment has seen occurrences where the container crashes for no apparent reason, followed by Redis errors on startup.