Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single map for collecting files across branches #839

Merged
merged 2 commits into from
Oct 1, 2024

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Sep 30, 2024

When looking at large profiles for inuse_space on dot com, I noticed the filename maps in prepareNormalBuild taking a bunch of memory. This PR avoids allocating a separate map per branch, instead having RepoWalker collect all the entries in a single instance variable.

This is one last piece of "low hanging fruit" related to index memory. After this, we need to take a holistic look and consider a true rewrite.

This helps allocations a bit on sgtest/megarepo. Note that I haven't been able to locally repro the type of profile we're seeing on dot com.

Baseline vs. PR

BenchmarkPrepareNormalBuild-10    	       1	1561963752 B/op	 4249175 allocs/op
BenchmarkPrepareNormalBuild-10    	       1	1472635928 B/op	 4238388 allocs/op

Follow up to #838.

@cla-bot cla-bot bot added the cla-signed label Sep 30, 2024
@jtibshirani
Copy link
Member Author

Here is the memory profile from dot com:

Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
    2.05GB 43.04% 43.04%     2.54GB 53.20%  github.com/sourcegraph/zoekt/gitindex.prepareNormalBuild
    0.88GB 18.49% 61.52%     0.88GB 18.49%  bytes.growSlice
    0.81GB 16.95% 78.47%     4.60GB 96.53%  github.com/sourcegraph/zoekt/gitindex.indexGitRepo
    0.25GB  5.22% 83.70%     0.25GB  5.22%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
    0.24GB  4.94% 88.63%     0.24GB  4.94%  github.com/go-git/go-git/v5/plumbing/object.simpleJoin (inline)
    0.13GB  2.80% 91.43%     0.13GB  2.80%  github.com/sourcegraph/zoekt.(*postingsBuilder).newSearchableString
    0.12GB  2.51% 93.95%     0.12GB  2.53%  github.com/sourcegraph/zoekt/build.(*Builder).Add
    0.11GB  2.28% 96.23%     0.11GB  2.28%  github.com/go-git/go-git/v5/plumbing.(*MemoryObject).Write
    0.10GB  2.09% 98.32%     0.10GB  2.09%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames
    0.02GB  0.46% 98.78%     0.03GB  0.66%  github.com/go-git/go-git/v5/plumbing/cache.(*ObjectLRU).Put

And the (truncated) output of list prepareNormalBuild:

         .          .    851:	for _, b := range branches {
         .          .    852:		commit, err := getCommit(repository, options.BranchPrefix, b)
         .          .    853:		if err != nil {
         .          .    854:			if options.AllowMissingBranch && err.Error() == "reference not found" {
         .          .    855:				continue
         .          .    856:			}
         .          .    857:
         .          .    858:			return nil, nil, nil, fmt.Errorf("getCommit: %w", err)
         .          .    859:		}
         .          .    860:
         .          .    861:		tree, err := commit.Tree()
         .          .    862:		if err != nil {
         .          .    863:			return nil, nil, nil, fmt.Errorf("commit.Tree: %w", err)
         .          .    864:		}
         .          .    865:
         .          .    866:		ig, err := newIgnoreMatcher(tree)
         .          .    867:		if err != nil {
         .          .    868:			return nil, nil, nil, fmt.Errorf("newIgnoreMatcher: %w", err)
         .          .    869:		}
         .          .    870:
         .   496.02MB    871:		files, subVersions, err := TreeToFiles(repository, tree, options.BuildOptions.RepositoryDescription.URL, repoCache)
         .          .    872:		if err != nil {
         .          .    873:			return nil, nil, nil, fmt.Errorf("TreeToFiles: %w", err)
         .          .    874:		}
         .          .    875:		for k, v := range files {
         .          .    876:			if ig.Match(k.Path) {
         .          .    877:				continue
         .          .    878:			}
  967.51MB   967.51MB    879:			repos[k] = v
    1.11GB     1.11GB    880:			branchMap[k] = append(branchMap[k], b)
         .          .    881:		}
         .          .    882:
         .          .    883:		branchVersions[b] = subVersions
         .          .    884:	}

Copy link
Member

@stefanhengl stefanhengl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I am surprised the impact isn't bigger?

Baseline vs. PR

BenchmarkPrepareNormalBuild-10    	       1	1561963752 B/op	 4249175 allocs/op
BenchmarkPrepareNormalBuild-10    	       1	1472635928 B/op	 4238388 allocs/op

@@ -359,12 +360,13 @@ func iterateManifest(mf *manifest.Manifest,
return nil, nil, err
}

files, versions, err := gitindex.TreeToFiles(topRepo, tree, projURL.String(), cache)
rw := gitindex.NewRepoWalker(topRepo, projURL.String(), cache)
versions, err := rw.CollectFiles(tree, rev, &ignore.Matcher{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to name this subVersions, like in gitindex/index.go?

repos[k] = existing
} else {
repos[k] = BlobIndexInfo{Repo: v, Branches: []string{b}}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization getting rid of this loop!

@jtibshirani
Copy link
Member Author

I am surprised the impact isn't bigger?

I was also surprised. When I looked at the local sgtest/megarepo profiles, the bulk of the allocations (and peak memory usage) come from go-git. However, on dot com we see some profiles where these maps in prepareNormalBuild really dominate memory. Unfortunately, I haven't been able to reproduce that type of profile in local benchmarks -- maybe we'd need a repo with a huge number of distinct paths, each of which is deeply nested?

@jtibshirani jtibshirani merged commit 6501360 into main Oct 1, 2024
9 checks passed
@jtibshirani jtibshirani deleted the jtibs/prepare-build branch October 1, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants