Removing intermediate caching on frontend for raw endpoint
Created by: keegancsmith
I added some instrumentation to the raw endpoint on Sourcegraph.com to see how it is used. Over the last 3 days effectively only the archive endpoint is used. Additionally only requests for the root path seem to be used.
I then mined some recent logs to see how often the same repository is fetched. It turns out the long tail of repositories are only fetched once, and some repos were fetched upto 4 times.
All this evidence together tells me storing the archive on disk in the frontend is not worth the cost. We can just directly stream the archive from gitserver. This does mean we won't set the Content-Length
HTTP header.
Considerations:
- Does removing
Content-Length
cause issues? - Is Sourcegraph.com representative of extension use? Do our customers use extensions which much more frequently access things other than the archive?
Pros:
- Removes the main use of the frontend volume. (need to confirm if its the only user). If we can remove the volume, then it simplifies deployment. (The volume has been a source of issues at multiple customers).
- Faster responses.
requests per second over 5min for the last 3 days:
Requests are dominated by rootarchive. Here is a graph of non-archive. rate isn't that useful, so it is the number of requests over 5min for the last 3 days. Notice we bursts of upto 100 requests over 5min for files that 404:
The below is a histogram. For each repo we observe how many times it was requested. So 126 repos are only requested once, 12 repos are requested 2 times, etc.
$ kubectl logs -lapp=sourcegraph-frontend -c frontend --tail=5000 > frontend.log
$ grep 'raw endpoint sending archive' frontend.log | grep -o 'repo=[^ ]*' | sort | uniq -c | awk '{ print $1 }' | sort | uniq -c
126 1
13 2
7 3
4 4
Here are the promql queries used
sum by (type) (rate(src_http_raw_endpoint_total[5m]))
sum by (type) (delta(src_http_raw_endpoint_total{type!~".*archive"}[5m]))