WIP: experiment with custom git archive command #424

keegancsmith · 2022-09-19T13:28:27Z

Goal is to see how the viability of replacing git archive with a format and command optimized to only send what sourcegraph cares about.

keegancsmith · 2022-09-19T13:33:51Z

The last few days I've been experimenting with an alternative to git archive which is aware of sourcegraph's ignore policies (.sourcegraph/ignore and large file limits). Additionally I wanted to make it aware of diffing trees, this is so we could end up with a fast way to just get what has changed. This has been implemented in go-git. The WIP code is at zoekt#424.

go-git just seems to be slow. It is over twice as slow, even though it ends up needing to unmarshal far less objects. This approach is likely worth exploring further though, given that I suspect this will scale with the size of output if we were as fast as git.

See this table for comparison

repo	output(git)	output(sg)	time(git)	time(sg)
megarepo	3.47GB	2.83GB	52s	132s
sourcegraph	145MB	96MB	1.1s	2s

Note: megarepo was recorded on the git-combine pod. sourcegraph was recorded on my macbook.

From profiling, there are surprising things. For example 12% is spent in Packfile.Close. This tells me there is likely no state keeping packfiles open, which means we are likely paying a huge cost per object just opening and looking inside of packfiles. Hopefully I can adjust my usage of the API. Alternatively I could introduce more state into go-git for performance.

The other next approach I was considering was writing this command in rust. In the past I wrote a small program using rust's bindings with libgit2 and it was pleasant.

GIT_DIR=$PWD /usr/bin/time -v git archive HEAD 2> git-archive.time | wc -c > git-archive.size
GIT_DIR=$PWD /usr/bin/time -v git-sg 2> git-sg.time | wc -c > git-sg.size

3 473 141 760
  Command being timed: "git archive HEAD"
  User time (seconds): 52.32
  System time (seconds): 9.97
  Percent of CPU this job got: 90%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 1m 8.88s
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 8755968
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 867731
  Voluntary context switches: 301216
  Involuntary context switches: 88
  Swaps: 0
  File system inputs: 0
  File system outputs: 8
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

2 839 758 848
  Command being timed: "../git-sg"
  User time (seconds): 132.83
  System time (seconds): 11.88
  Percent of CPU this job got: 99%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 2m 26.00s
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 16032144
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 671416
  Voluntary context switches: 194976
  Involuntary context switches: 272
  Swaps: 0
  File system inputs: 864
  File system outputs: 0
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

Sourcegraph repo on the mac

/usr/bin/time -l git sg | wc -c
      2.09 real         1.41 user         0.57 sys
         333512704  maximum resident set size
                 0  average shared memory size
                 0  average unshared data size
                 0  average unshared stack size
             82341  page reclaims
                 0  page faults
                 0  swaps
                 0  block input operations
                 0  block output operations
                 0  messages sent
                 0  messages received
               181  signals received
              5951  voluntary context switches
              3009  involuntary context switches
          26217999  instructions retired
          22263437  cycles elapsed
            733184  peak memory footprint
96 385 536
/usr/bin/time -l git archive HEAD | wc -c
      1.12 real         0.54 user         0.13 sys
         200888320  maximum resident set size
                 0  average shared memory size
                 0  average unshared data size
                 0  average unshared stack size
             43711  page reclaims
              5613  page faults
                 0  swaps
                 0  block input operations
                 0  block output operations
                 0  messages sent
                 0  messages received
                 0  signals received
              9579  voluntary context switches
               327  involuntary context switches
        3305843152  instructions retired
        2552117752  cycles elapsed
          96268288  peak memory footprint
145 233 920

keegancsmith · 2022-09-23T07:49:17Z

Update from yesterday I forgot to post:

I spent a lot of time writing some fun integration with git-cat-file. The code is quite nice and performant, but still doesn't beat git archive. Even though archive sends 1.5x more data (96mb vs 145mb). This is on sourcegraph/sourcegraph.

Hyperfine results:

$ hyperfine -w 1 'git archive --worktree-attributes --format=tar HEAD' 'git sg' 'GIT_SG_FILTER=1 git sg' 'GIT_SG_CATFILE=1 git sg'
Benchmark 1: git archive --worktree-attributes --format=tar HEAD
  Time (mean ± σ):     338.5 ms ±   3.3 ms    [User: 310.0 ms, System: 28.4 ms]
  Range (min … max):   335.3 ms … 344.9 ms    10 runs

Benchmark 2: git sg
  Time (mean ± σ):     905.7 ms ±  16.0 ms    [User: 837.3 ms, System: 95.6 ms]
  Range (min … max):   878.1 ms … 926.4 ms    10 runs

Benchmark 3: GIT_SG_FILTER=1 git sg
  Time (mean ± σ):     377.8 ms ±   6.4 ms    [User: 388.1 ms, System: 95.1 ms]
  Range (min … max):   367.6 ms … 388.2 ms    10 runs

Benchmark 4: GIT_SG_CATFILE=1 git sg
  Time (mean ± σ):     451.8 ms ±  10.5 ms    [User: 372.7 ms, System: 155.8 ms]
  Range (min … max):   441.7 ms … 478.6 ms    10 runs

Summary
  'git archive --worktree-attributes --format=tar HEAD' ran
    1.12 ± 0.02 times faster than 'GIT_SG_FILTER=1 git sg'
    1.33 ± 0.03 times faster than 'GIT_SG_CATFILE=1 git sg'
    2.68 ± 0.05 times faster than 'git sg'

Looking at CPU profiles for cat-file, we spend as much time running Info as Contents. To me this is a sign that the overhead of RPC / Info is not worth it. We could look into a queue like design to send multiple blob/info requests out before reading, but that seems complicated and based on the perf I doubt will make it faster than archive.

Final attempt in this experiment, mix together git ls-tree -r -l (to get object size) with git cat-file for contents only. This does mean we will explore trees which are excluded. This is fine for now, but is an overhead when thinking about a future with sub-repo perms. Additionally it won't affect the repo I am testing against, since it has no ignore rules (only size filters).

keegancsmith · 2022-09-23T07:53:08Z

Using ls-tree is pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive.

There is opportunity to make it faster:

async send and read of contents from git-cat-file
minor: directly use gitCatFileBatchReader and use git cat-file --batch

I did some profiling, and this solution barely generated any garbage so is super efficient. This means I'll export the code and integrate it directly into gitserver to try and create and end to end demo.

A note on buffering. Testing with hyperfine adding output buffering slowed it down slightly. I wonder if in practice though the buffer will be more important due to the output being over the network rather than to /dev/null.

$ hyperfine -w 1 'git archive --worktree-attributes --format=tar HEAD' 'git sg' 'GIT_SG_FILTER=1 git sg' 'GIT_SG_CATFILE=1 git sg' 'GIT_SG_LSTREE=1 git sg'
Benchmark 1: git archive --worktree-attributes --format=tar HEAD
  Time (mean ± σ):     348.1 ms ±   3.8 ms    [User: 319.4 ms, System: 28.1 ms]
  Range (min … max):   342.0 ms … 353.2 ms    10 runs

Benchmark 2: git sg
  Time (mean ± σ):     921.3 ms ±  12.0 ms    [User: 862.0 ms, System: 91.2 ms]
  Range (min … max):   899.9 ms … 937.4 ms    10 runs

Benchmark 3: GIT_SG_FILTER=1 git sg
  Time (mean ± σ):     385.1 ms ±   7.8 ms    [User: 395.5 ms, System: 93.1 ms]
  Range (min … max):   373.8 ms … 402.2 ms    10 runs

Benchmark 4: GIT_SG_CATFILE=1 git sg
  Time (mean ± σ):     451.4 ms ±   8.3 ms    [User: 383.2 ms, System: 145.2 ms]
  Range (min … max):   439.2 ms … 463.0 ms    10 runs

Benchmark 5: GIT_SG_LSTREE=1 git sg
  Time (mean ± σ):     358.3 ms ±   4.2 ms    [User: 359.0 ms, System: 113.7 ms]
  Range (min … max):   352.6 ms … 367.2 ms    10 runs

Summary
  'git archive --worktree-attributes --format=tar HEAD' ran
    1.03 ± 0.02 times faster than 'GIT_SG_LSTREE=1 git sg'
    1.11 ± 0.03 times faster than 'GIT_SG_FILTER=1 git sg'
    1.30 ± 0.03 times faster than 'GIT_SG_CATFILE=1 git sg'
    2.65 ± 0.05 times faster than 'git sg'

The profile output was really hard to read due to the arb nesting of calls to writeTree. This introduces a manual stack, but slightly adjusts the order of output. I'd prefer the normal DFS order to match git archive, but atleast for profiling this is good for now.

This is significantly faster than using go-git.

lunch time

This will avoid allocations when using it.

We only read the entries field, so this makes it easier to use a different impl.

And its pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive.

This is our slowest implementation so far! I believe this is because gitobj has no caching between parsing packfiles so it pays the cost on each object retrieval.

keegancsmith added 27 commits October 10, 2022 09:08

wip

a926a1a

fix test

e7e65e6

same output as git archive for "tar t"

ccf69b3

capture state in archiveWriter struct for better readability

950b26a

set mode

89c6970

do not do dotgit detection since it brakes bare repos

db2d110

cpu_profile flag

91ffa5b

try out keepdescriptors

04081cc

add memprofile

6500eb1

optionally buffer output if GIT_SG_BUFFER is set

753864c

add experimental GIT_SG_FILTER which just filters git archive

cebf65b

This is significantly faster than using go-git.

getting started on git-cat-file integration

83325cf

lunch time

add contents method for git-cat-file

15902a2

handle missing refs

0027f41

factor out common logic in cat-file

c95ac4f

add hash native API to catfile

5daae67

This will avoid allocations when using it.

make archive writer based on tree entries instead of object.TreeEntry

801d2a4

We only read the entries field, so this makes it easier to use a different impl.

move archive code into own file

f7a1f9c

wip interface to allow swapping out backend for archive writer

a4cdae0

implement TreeEntries for cat-file

ae81031

test all modes

d707154

wip lstree

d88c93f

refactor catfile to separate out gitCatFileBatchReader

d29b6ba

ls-tree just writing an archive of names

01adb4e

ls-tree implemented

d45e927

And its pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive.

skip TestDo on CI if missing .git

5040bd7

check for .git in all tests

b07c069

keegancsmith force-pushed the k/git-sg branch from 00a6a29 to b07c069 Compare October 10, 2022 07:08

implement archiver via git-lfs/gitobj

615d1d9

This is our slowest implementation so far! I believe this is because gitobj has no caching between parsing packfiles so it pays the cost on each object retrieval.

keegancsmith mentioned this pull request Nov 14, 2023

☂️ Search: improve Zoekt indexing sourcegraph/sourcegraph-public-snapshot#58133

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: experiment with custom git archive command #424

WIP: experiment with custom git archive command #424

keegancsmith commented Sep 19, 2022

keegancsmith commented Sep 19, 2022

keegancsmith commented Sep 23, 2022

keegancsmith commented Sep 23, 2022

WIP: experiment with custom git archive command #424

Are you sure you want to change the base?

WIP: experiment with custom git archive command #424

Conversation

keegancsmith commented Sep 19, 2022

keegancsmith commented Sep 19, 2022

keegancsmith commented Sep 23, 2022

keegancsmith commented Sep 23, 2022