Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: experiment with custom git archive command #424

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

Conversation

keegancsmith
Copy link
Member

Goal is to see how the viability of replacing git archive with a format and command optimized to only send what sourcegraph cares about.

@keegancsmith
Copy link
Member Author

The last few days I've been experimenting with an alternative to git archive which is aware of sourcegraph's ignore policies (.sourcegraph/ignore and large file limits). Additionally I wanted to make it aware of diffing trees, this is so we could end up with a fast way to just get what has changed. This has been implemented in go-git. The WIP code is at zoekt#424.

go-git just seems to be slow. It is over twice as slow, even though it ends up needing to unmarshal far less objects. This approach is likely worth exploring further though, given that I suspect this will scale with the size of output if we were as fast as git.

See this table for comparison

repo output(git) output(sg) time(git) time(sg)
megarepo 3.47GB 2.83GB 52s 132s
sourcegraph 145MB 96MB 1.1s 2s

Note: megarepo was recorded on the git-combine pod. sourcegraph was recorded on my macbook.

From profiling, there are surprising things. For example 12% is spent in Packfile.Close. This tells me there is likely no state keeping packfiles open, which means we are likely paying a huge cost per object just opening and looking inside of packfiles. Hopefully I can adjust my usage of the API. Alternatively I could introduce more state into go-git for performance.

The other next approach I was considering was writing this command in rust. In the past I wrote a small program using rust's bindings with libgit2 and it was pleasant.

GIT_DIR=$PWD /usr/bin/time -v git archive HEAD 2> git-archive.time | wc -c > git-archive.size
GIT_DIR=$PWD /usr/bin/time -v git-sg 2> git-sg.time | wc -c > git-sg.size
3 473 141 760
  Command being timed: "git archive HEAD"
  User time (seconds): 52.32
  System time (seconds): 9.97
  Percent of CPU this job got: 90%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 1m 8.88s
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 8755968
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 867731
  Voluntary context switches: 301216
  Involuntary context switches: 88
  Swaps: 0
  File system inputs: 0
  File system outputs: 8
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

2 839 758 848
  Command being timed: "../git-sg"
  User time (seconds): 132.83
  System time (seconds): 11.88
  Percent of CPU this job got: 99%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 2m 26.00s
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 16032144
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 671416
  Voluntary context switches: 194976
  Involuntary context switches: 272
  Swaps: 0
  File system inputs: 864
  File system outputs: 0
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

Sourcegraph repo on the mac

/usr/bin/time -l git sg | wc -c
      2.09 real         1.41 user         0.57 sys
         333512704  maximum resident set size
                 0  average shared memory size
                 0  average unshared data size
                 0  average unshared stack size
             82341  page reclaims
                 0  page faults
                 0  swaps
                 0  block input operations
                 0  block output operations
                 0  messages sent
                 0  messages received
               181  signals received
              5951  voluntary context switches
              3009  involuntary context switches
          26217999  instructions retired
          22263437  cycles elapsed
            733184  peak memory footprint
96 385 536
/usr/bin/time -l git archive HEAD | wc -c
      1.12 real         0.54 user         0.13 sys
         200888320  maximum resident set size
                 0  average shared memory size
                 0  average unshared data size
                 0  average unshared stack size
             43711  page reclaims
              5613  page faults
                 0  swaps
                 0  block input operations
                 0  block output operations
                 0  messages sent
                 0  messages received
                 0  signals received
              9579  voluntary context switches
               327  involuntary context switches
        3305843152  instructions retired
        2552117752  cycles elapsed
          96268288  peak memory footprint
145 233 920

@keegancsmith
Copy link
Member Author

Update from yesterday I forgot to post:

I spent a lot of time writing some fun integration with git-cat-file. The code is quite nice and performant, but still doesn't beat git archive. Even though archive sends 1.5x more data (96mb vs 145mb). This is on sourcegraph/sourcegraph.

Hyperfine results:

$ hyperfine -w 1 'git archive --worktree-attributes --format=tar HEAD' 'git sg' 'GIT_SG_FILTER=1 git sg' 'GIT_SG_CATFILE=1 git sg'
Benchmark 1: git archive --worktree-attributes --format=tar HEAD
  Time (mean ± σ):     338.5 ms ±   3.3 ms    [User: 310.0 ms, System: 28.4 ms]
  Range (min … max):   335.3 ms … 344.9 ms    10 runs

Benchmark 2: git sg
  Time (mean ± σ):     905.7 ms ±  16.0 ms    [User: 837.3 ms, System: 95.6 ms]
  Range (min … max):   878.1 ms … 926.4 ms    10 runs

Benchmark 3: GIT_SG_FILTER=1 git sg
  Time (mean ± σ):     377.8 ms ±   6.4 ms    [User: 388.1 ms, System: 95.1 ms]
  Range (min … max):   367.6 ms … 388.2 ms    10 runs

Benchmark 4: GIT_SG_CATFILE=1 git sg
  Time (mean ± σ):     451.8 ms ±  10.5 ms    [User: 372.7 ms, System: 155.8 ms]
  Range (min … max):   441.7 ms … 478.6 ms    10 runs

Summary
  'git archive --worktree-attributes --format=tar HEAD' ran
    1.12 ± 0.02 times faster than 'GIT_SG_FILTER=1 git sg'
    1.33 ± 0.03 times faster than 'GIT_SG_CATFILE=1 git sg'
    2.68 ± 0.05 times faster than 'git sg'

Looking at CPU profiles for cat-file, we spend as much time running Info as Contents. To me this is a sign that the overhead of RPC / Info is not worth it. We could look into a queue like design to send multiple blob/info requests out before reading, but that seems complicated and based on the perf I doubt will make it faster than archive.

Final attempt in this experiment, mix together git ls-tree -r -l (to get object size) with git cat-file for contents only. This does mean we will explore trees which are excluded. This is fine for now, but is an overhead when thinking about a future with sub-repo perms. Additionally it won't affect the repo I am testing against, since it has no ignore rules (only size filters).

@keegancsmith
Copy link
Member Author

Using ls-tree is pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive.

There is opportunity to make it faster:

  • async send and read of contents from git-cat-file
  • minor: directly use gitCatFileBatchReader and use git cat-file --batch

I did some profiling, and this solution barely generated any garbage so is super efficient. This means I'll export the code and integrate it directly into gitserver to try and create and end to end demo.

A note on buffering. Testing with hyperfine adding output buffering slowed it down slightly. I wonder if in practice though the buffer will be more important due to the output being over the network rather than to /dev/null.

$ hyperfine -w 1 'git archive --worktree-attributes --format=tar HEAD' 'git sg' 'GIT_SG_FILTER=1 git sg' 'GIT_SG_CATFILE=1 git sg' 'GIT_SG_LSTREE=1 git sg'
Benchmark 1: git archive --worktree-attributes --format=tar HEAD
  Time (mean ± σ):     348.1 ms ±   3.8 ms    [User: 319.4 ms, System: 28.1 ms]
  Range (min … max):   342.0 ms … 353.2 ms    10 runs

Benchmark 2: git sg
  Time (mean ± σ):     921.3 ms ±  12.0 ms    [User: 862.0 ms, System: 91.2 ms]
  Range (min … max):   899.9 ms … 937.4 ms    10 runs

Benchmark 3: GIT_SG_FILTER=1 git sg
  Time (mean ± σ):     385.1 ms ±   7.8 ms    [User: 395.5 ms, System: 93.1 ms]
  Range (min … max):   373.8 ms … 402.2 ms    10 runs

Benchmark 4: GIT_SG_CATFILE=1 git sg
  Time (mean ± σ):     451.4 ms ±   8.3 ms    [User: 383.2 ms, System: 145.2 ms]
  Range (min … max):   439.2 ms … 463.0 ms    10 runs

Benchmark 5: GIT_SG_LSTREE=1 git sg
  Time (mean ± σ):     358.3 ms ±   4.2 ms    [User: 359.0 ms, System: 113.7 ms]
  Range (min … max):   352.6 ms … 367.2 ms    10 runs

Summary
  'git archive --worktree-attributes --format=tar HEAD' ran
    1.03 ± 0.02 times faster than 'GIT_SG_LSTREE=1 git sg'
    1.11 ± 0.03 times faster than 'GIT_SG_FILTER=1 git sg'
    1.30 ± 0.03 times faster than 'GIT_SG_CATFILE=1 git sg'
    2.65 ± 0.05 times faster than 'git sg'

The profile output was really hard to read due to the arb nesting of
calls to writeTree. This introduces a manual stack, but slightly adjusts
the order of output. I'd prefer the normal DFS order to match git
archive, but atleast for profiling this is good for now.
This is significantly faster than using go-git.
This will avoid allocations when using it.
We only read the entries field, so this makes it easier to use a
different impl.
And its pretty much the same speed as git archive on sourcegraph repo.
We only skip 7 files in that repo, which means its hard to beat the
speed of git archive.
This is our slowest implementation so far! I believe this is because
gitobj has no caching between parsing packfiles so it pays the cost on
each object retrieval.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant