status-indicator: Show actual cloning state
Created by: mrnugget
This changes the status indicator from reporting whether the repositories are updating to including their actual cloning status.
This is an experiment, since the cloning status is computed with every request, every 5 seconds, by loading the complete list of repository names from the database and asking the gitserver shards whether they are cloned. Each gitserver instance then has to do a maximum of 2 os.Stat
calls per repository name.
The big upside is that this reports the actual "is this repository cloned" status. A tiny asterisk: just because a repository is currently not cloned doesn't mean it's cloning. But that could be fixed by request a new update for each uncloned repository we come across.
Questions/TODOs:
-
the metrics for the new ListAllRepoNames
don't show up inrepo-updater
's/metrics
endpoint -
the code in the gitserver.Client
methodAreReposCloned
is duplicated fromRepoInfo
since it needs to use the same sharding logic. This should probably be refactored -
No tests
Merge request reports
Activity
Created by: tsenart
@mrnugget: Regarding the metrics, you need to register them: https://github.com/sourcegraph/sourcegraph/blob/4b99c384c314b34640c9e8f597055a09d815dd85/cmd/repo-updater/main.go#L70
Created by: mrnugget
I ran a few tests on my MacBook to get ballpark figures for how fast/slow it is to ask gitserver about the cloned status for multiple thousands of repositories.
First I created dummy request data that makes
gitserver
check each repository name with the maximum of twoos.Stat
calls since none of the repositories exists on my local disk:ruby -e 'require "json"; puts JSON.dump({"Repos" => (0..30_000).map {|i| "github.com/testing/repo-#{i}"}})' | jq . > ~/tmp/30_000_repos_are_cloned.json
Then I sent that to the local
gitserver
instance (no sharding!):time curl -s 'http://localhost:3178/are-repos-cloned' -X POST -H 'Content-Type: application/json' --data @/Users/thorstenball/tmp/30_000_repos_are_cloned.json >/dev/null
Here are the results for 10+ runs:
0.01s user 0.01s system 3% cpu 0.462 total 0.01s user 0.01s system 5% cpu 0.329 total 0.01s user 0.01s system 5% cpu 0.329 total 0.01s user 0.01s system 5% cpu 0.327 total 0.01s user 0.01s system 6% cpu 0.340 total 0.01s user 0.01s system 5% cpu 0.354 total 0.01s user 0.00s system 4% cpu 0.325 total 0.01s user 0.01s system 4% cpu 0.334 total 0.01s user 0.01s system 4% cpu 0.331 total 0.01s user 0.01s system 6% cpu 0.311 total 0.01s user 0.01s system 5% cpu 0.311 total 0.01s user 0.01s system 5% cpu 0.299 total 0.01s user 0.01s system 4% cpu 0.303 total 0.01s user 0.01s system 6% cpu 0.312 total
That doesn't seem too bad, especially since sharding could reduce this (not sure though, since there's the added overhead of additional requests).
What also comes on top is that
repo-updater
needs to load the names from the databaseCreated by: codecov[bot]
Codecov Report
Merging #4591 into master will decrease coverage by
0.01%
. The diff coverage is40.86%
.@@ Coverage Diff @@ ## master #4591 +/- ## ========================================== - Coverage 47.45% 47.43% -0.02% ========================================== Files 722 722 Lines 43828 43917 +89 Branches 1752 1752 ========================================== + Hits 20797 20833 +36 - Misses 21056 21106 +50 - Partials 1975 1978 +3
Impacted Files Coverage Δ cmd/repo-updater/repos/store.go 83.28% <0%> (-3.31%)
web/src/nav/StatusMessagesNavItem.tsx 86.95% <100%> (ø)
cmd/repo-updater/repos/testing.go 82.5% <12.5%> (-2.2%)
cmd/gitserver/server/server.go 45.4% <6.66%> (-0.86%)
cmd/repo-updater/repoupdater/server.go 63.04% <62.5%> (-0.42%)
cmd/repo-updater/repos/observability.go 92.4% <62.5%> (-3.23%)
...d/frontend/internal/authz/bitbucketserver/store.go 76.22% <0%> (+1.39%)
Created by: mrnugget
On
k8s.sgdev.org
it takes ~6 seconds for one status message:Example trace: https://app.lightstep.com/sourcegraph-dev/trace?span_guid=39b1467c99d8683b&at_micros=1560946392318752#span-39b1467c99d8683b
I guess that could/should be faster.
The problem is though, that apparently 20 repositories are never cloned:
Created by: mrnugget
The problem is that we're saying "Currently cloning" in the message, even though these repositories are only enqueued for future cloning/updating.
I could open the repository pages of 18 of the 20 repositories and they were cloned. The two others simply could not be cloned due to things like these:
error cloning repo" repo=github.com/gorilla/websocket err="error cloning repo: repo github.com/gorilla/websocket (ssh://git@phabricator.sgdev.org/diffusion/5/web.git) not cloneable: context deadline exceeded (output follows)\n\nssh: connect to host phabricator.sgdev.org port 22: Network unreachable\r\n
I then deleted the "broken" Phabricator external service now we're done to "all repositories up to date"
Created by: mrnugget
I added a small in-memory cache for the "number of not-cloned repos" to repo-updater with a TTL of 30 seconds.
From my side, this is ready to review and also ready to merge.
But please note: the behavior of the number changed slightly, as noted in the original PR description. When it now says "X repositories enqueued for cloning..." that doesn't mean that they are currently cloning, but that they will be cloned at some point. Or in other words: the number matches the repositories in the repositories-page when the "not cloned" filter is activated, not the "Cloning" filter.
@christinaforney Note the textual changes in the diff. Since this now computes the actual number of repositories that have not been cloned yet, I changed the text to reflect that.
Created by: mrnugget
@tsenart @keegancsmith Little thought I had waking up: we should probably bust this new cache when external services are synced.
And another thought I had that I want to investigate when I'm back: since we only want the number of not-cloned repos, how fast would it be to do a
SELECT COUNT(*) FROM repo;
onrepo-updater
and then either (a) use the existinglist?cloned
endpoint on thegitserver
replicas (which walks through the directory to build a list of repos) or (b) use a newnumber-of-cloned-repos
endpoint that also walks through the directory but only keeps count and doesn't return a huge list. My thinking is that walking through the file system might be slow, but we'd have a lot less pressure on the GC when we stop sending around lists of tens of thousands of repository names.Created by: keegancsmith
@tsenart @keegancsmith Little thought I had waking up: we should probably bust this new cache when external services are synced.
The cache expires in 30s. That should be short enough that its fine?
And another thought I had that I want to investigate when I'm back: since we only want the number of not-cloned repos, how fast would it be to do a
SELECT COUNT(*) FROM repo;
onrepo-updater
and then either (a) use the existinglist?cloned
endpoint on thegitserver
replicas (which walks through the directory to build a list of repos) or (b) use a newnumber-of-cloned-repos
endpoint that also walks through the directory but only keeps count and doesn't return a huge list. My thinking is that walking through the file system might be slow, but we'd have a lot less pressure on the GC when we stop sending around lists of tens of thousands of repository names.Yup that makes sense to me. But that could also be a follow-up PR since this works! :)