Skip to content
Snippets Groups Projects

status-indicator: Show actual cloning state

Created by: mrnugget

This changes the status indicator from reporting whether the repositories are updating to including their actual cloning status.

This is an experiment, since the cloning status is computed with every request, every 5 seconds, by loading the complete list of repository names from the database and asking the gitserver shards whether they are cloned. Each gitserver instance then has to do a maximum of 2 os.Stat calls per repository name.

The big upside is that this reports the actual "is this repository cloned" status. A tiny asterisk: just because a repository is currently not cloned doesn't mean it's cloning. But that could be fixed by request a new update for each uncloned repository we come across.

Questions/TODOs:

  • the metrics for the new ListAllRepoNames don't show up in repo-updater's /metrics endpoint
  • the code in the gitserver.Client method AreReposCloned is duplicated from RepoInfo since it needs to use the same sharding logic. This should probably be refactored
  • No tests

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: tsenart

    @mrnugget: Regarding the metrics, you need to register them: https://github.com/sourcegraph/sourcegraph/blob/4b99c384c314b34640c9e8f597055a09d815dd85/cmd/repo-updater/main.go#L70

  • Created by: mrnugget

    I ran a few tests on my MacBook to get ballpark figures for how fast/slow it is to ask gitserver about the cloned status for multiple thousands of repositories.

    First I created dummy request data that makes gitserver check each repository name with the maximum of two os.Stat calls since none of the repositories exists on my local disk:

    ruby -e 'require "json"; puts JSON.dump({"Repos" => (0..30_000).map {|i| "github.com/testing/repo-#{i}"}})' | jq . > ~/tmp/30_000_repos_are_cloned.json

    Then I sent that to the local gitserver instance (no sharding!):

    time curl -s 'http://localhost:3178/are-repos-cloned' -X POST -H 'Content-Type: application/json' --data @/Users/thorstenball/tmp/30_000_repos_are_cloned.json >/dev/null

    Here are the results for 10+ runs:

    0.01s user 0.01s system 3% cpu 0.462 total
    0.01s user 0.01s system 5% cpu 0.329 total
    0.01s user 0.01s system 5% cpu 0.329 total
    0.01s user 0.01s system 5% cpu 0.327 total
    0.01s user 0.01s system 6% cpu 0.340 total
    0.01s user 0.01s system 5% cpu 0.354 total
    0.01s user 0.00s system 4% cpu 0.325 total
    0.01s user 0.01s system 4% cpu 0.334 total
    0.01s user 0.01s system 4% cpu 0.331 total
    0.01s user 0.01s system 6% cpu 0.311 total
    0.01s user 0.01s system 5% cpu 0.311 total
    0.01s user 0.01s system 5% cpu 0.299 total
    0.01s user 0.01s system 4% cpu 0.303 total
    0.01s user 0.01s system 6% cpu 0.312 total

    That doesn't seem too bad, especially since sharding could reduce this (not sure though, since there's the added overhead of additional requests).

    What also comes on top is that repo-updater needs to load the names from the database

  • Created by: codecov[bot]

    Codecov Report

    Merging #4591 into master will decrease coverage by 0.01%. The diff coverage is 40.86%.

    @@            Coverage Diff             @@
    ##           master    #4591      +/-   ##
    ==========================================
    - Coverage   47.45%   47.43%   -0.02%     
    ==========================================
      Files         722      722              
      Lines       43828    43917      +89     
      Branches     1752     1752              
    ==========================================
    + Hits        20797    20833      +36     
    - Misses      21056    21106      +50     
    - Partials     1975     1978       +3
    Impacted Files Coverage Δ
    cmd/repo-updater/repos/store.go 83.28% <0%> (-3.31%) :arrow_down:
    web/src/nav/StatusMessagesNavItem.tsx 86.95% <100%> (ø) :arrow_up:
    cmd/repo-updater/repos/testing.go 82.5% <12.5%> (-2.2%) :arrow_down:
    cmd/gitserver/server/server.go 45.4% <6.66%> (-0.86%) :arrow_down:
    cmd/repo-updater/repoupdater/server.go 63.04% <62.5%> (-0.42%) :arrow_down:
    cmd/repo-updater/repos/observability.go 92.4% <62.5%> (-3.23%) :arrow_down:
    ...d/frontend/internal/authz/bitbucketserver/store.go 76.22% <0%> (+1.39%) :arrow_up:
  • Created by: mrnugget

    On k8s.sgdev.org it takes ~6 seconds for one status message:

    Screen Shot 2019-06-19 at 14 12 52

    Example trace: https://app.lightstep.com/sourcegraph-dev/trace?span_guid=39b1467c99d8683b&at_micros=1560946392318752#span-39b1467c99d8683b

    I guess that could/should be faster.

    The problem is though, that apparently 20 repositories are never cloned:

    Screen Shot 2019-06-19 at 14 13 21
  • Created by: mrnugget

    The problem is that we're saying "Currently cloning" in the message, even though these repositories are only enqueued for future cloning/updating.

    I could open the repository pages of 18 of the 20 repositories and they were cloned. The two others simply could not be cloned due to things like these:

    error cloning repo" repo=github.com/gorilla/websocket err="error cloning repo: repo github.com/gorilla/websocket (ssh://git@phabricator.sgdev.org/diffusion/5/web.git) not cloneable: context deadline exceeded (output follows)\n\nssh: connect to host phabricator.sgdev.org port 22: Network unreachable\r\n

    I then deleted the "broken" Phabricator external service now we're done to "all repositories up to date"

  • Created by: tsenart

    I agreed with @mrnugget to bring this PR to completion tomorrow since he's going to be off.

  • Created by: mrnugget

    I added a small in-memory cache for the "number of not-cloned repos" to repo-updater with a TTL of 30 seconds.

    From my side, this is ready to review and also ready to merge.

    But please note: the behavior of the number changed slightly, as noted in the original PR description. When it now says "X repositories enqueued for cloning..." that doesn't mean that they are currently cloning, but that they will be cloned at some point. Or in other words: the number matches the repositories in the repositories-page when the "not cloned" filter is activated, not the "Cloning" filter.


    @christinaforney Note the textual changes in the diff. Since this now computes the actual number of repositories that have not been cloned yet, I changed the text to reflect that.

  • Created by: christinaforney

    @mrnugget - I like your wording change! I think it's important to be accurate in our descriptions as to what is going on, and this does that.

  • Created by: mrnugget

    @tsenart @keegancsmith Little thought I had waking up: we should probably bust this new cache when external services are synced.


    And another thought I had that I want to investigate when I'm back: since we only want the number of not-cloned repos, how fast would it be to do a SELECT COUNT(*) FROM repo; on repo-updater and then either (a) use the existing list?cloned endpoint on the gitserver replicas (which walks through the directory to build a list of repos) or (b) use a new number-of-cloned-repos endpoint that also walks through the directory but only keeps count and doesn't return a huge list. My thinking is that walking through the file system might be slow, but we'd have a lot less pressure on the GC when we stop sending around lists of tens of thousands of repository names.

  • Created by: tsenart

    @mrnugget: Since you have a few ideas you'd like pursue further in this change set, and I have to focus on load testing Bitbucket Server ACLs today, would you feel comfortable picking this back up next week when you're back?

  • Created by: keegancsmith

    @tsenart @keegancsmith Little thought I had waking up: we should probably bust this new cache when external services are synced.

    The cache expires in 30s. That should be short enough that its fine?

    And another thought I had that I want to investigate when I'm back: since we only want the number of not-cloned repos, how fast would it be to do a SELECT COUNT(*) FROM repo; on repo-updater and then either (a) use the existing list?cloned endpoint on the gitserver replicas (which walks through the directory to build a list of repos) or (b) use a new number-of-cloned-repos endpoint that also walks through the directory but only keeps count and doesn't return a huge list. My thinking is that walking through the file system might be slow, but we'd have a lot less pressure on the GC when we stop sending around lists of tens of thousands of repository names.

    Yup that makes sense to me. But that could also be a follow-up PR since this works! :)

  • Created by: mrnugget

    Thanks for the changes @keegancsmith and bringing this over the finishing line :)

Please register or sign in to reply
Loading