Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid fetching files over size limit #814

Merged
merged 1 commit into from
Sep 5, 2024
Merged

Avoid fetching files over size limit #814

merged 1 commit into from
Sep 5, 2024

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Sep 4, 2024

We never index files over 1MB, unless the "LargeFiles" allowlist is set. So in
most cases, we can avoid fetching them at all.

This PR updates the git fetch to filter out files over 1MB when possible, and
exclude tags. It also refactors the very long gitIndex method.

@cla-bot cla-bot bot added the cla-signed label Sep 4, 2024
@jtibshirani
Copy link
Member Author

jtibshirani commented Sep 4, 2024

To get a sense of the improvement, here's the difference in just running time git fetch... or the chromium repo on my laptop. Note: these timings are directionally correct, but not "tight" because of network variability.

git fetch --depth=1 [email protected]:chromium/chromium.git HEAD  33.86s user 5.62s system 31% cpu 2:06.73 total
git fetch --depth=1 --filter=blob:limit=1m  HEAD  26.72s user 4.14s system 31% cpu 1:38.35 total

I anticipate this will make a positive difference to indexing latency (but not huge). I like that it avoids pathological behavior for repos that have useless enormous files. And after looking at a bunch of traces on S2, it is clear that fetching significantly contributes to indexing latency, especially for "medium size" repos. Example:

2024/09/04 20:06:21.467710 | 169.461023 | github.com/llvm/llvm-project
-- | -- | --
20:06:21.467711 | .     1 | ... branches: [HEAD@d21e731c42d6b967e29dbe2edc16c1b86885df0d]
20:06:21.469175 | .  1465 | ... [git -c init.defaultBranch=nonExistentBranchBB0FOFCH32 init --bare /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git]
20:06:21.474080 | .  4904 | ... success
20:06:21.474109 | .    29 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git -c protocol.version=2 -c http.extraHeader=X-Sourcegraph-Actor-UID: internal fetch --depth=1 http://sourcegraph-frontend-internal:80/.internal/git/github.com/llvm/llvm-project d21e731c42d6b967e29dbe2edc16c1b86885df0d]
20:07:25.651004 | 64.176894 | ... success
20:07:25.651084 | .    80 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git update-ref HEAD d21e731c42d6b967e29dbe2edc16c1b86885df0d]
20:07:25.653199 | .  2115 | ... success
20:07:25.653239 | .    40 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.archived 0]
20:07:25.655049 | .  1810 | ... success
20:07:25.655089 | .    40 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.fork 0]
20:07:25.657153 | .  2065 | ... success
20:07:25.657194 | .    41 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.name github.com/llvm/llvm-project]
20:07:25.659307 | .  2113 | ... success
20:07:25.659332 | .    24 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.priority 27769]
20:07:25.660962 | .  1630 | ... success
20:07:25.660986 | .    24 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.public 1]
20:07:25.662670 | .  1685 | ... success
20:07:25.662694 | .    23 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.repoid 136896]
20:07:25.664282 | .  1588 | ... success
20:07:25.664311 | .    29 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,javascript:scip,python:scip,rust:scip,kotlin:scip,magik:scip,typescript:scip,hack:scip,scala:scip,go:scip,c_sharp:scip,ruby:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file **/fixtures.json -shard_merging /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git]
20:09:10.871375 | 105.207064 | ... success
20:09:10.928730 | . 57355 | ... state: success

}

// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files.
if len(o.LargeFiles) == 0 {
fetchArgs = append(fetchArgs, "--filter=blob:limit=1m")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the interesting part of the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool! In the past I explored wanting to add a smarter way to fetch git bundles to handle things like the ignore rule. But given for most places this isn't set this is a great optimization.

@@ -887,18 +887,19 @@ func createDocument(key fileKey,
opts build.Options,
) (zoekt.Document, error) {
blob, err := repos[key].Repo.BlobObject(key.ID)

// We filter out large documents when fetching the repo. So if an object is too large, it will not be found.
if errors.Is(err, plumbing.ErrObjectNotFound) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's just the blobs that are excluded, we still traverse all files in the commit tree and report "skipped docs" correctly. I tested this locally by indexing golang/go with and without this change, and checking that the number of docs matching "NOT-INDEXED: file size exceeds maximum" was identical.

@jtibshirani jtibshirani requested a review from a team September 4, 2024 21:44
}

// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files.
if len(o.LargeFiles) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SG populate anything here by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not. It requires a customer to modify the site config.

@jtibshirani
Copy link
Member Author

Another example, this one from dot com where indexing is pretty fast but fetch takes a long time:


2024/09/04 19:52:55.320595 | 442.057663 | github.com/vespa-engine/vespa
-- | -- | --
19:52:55.320596 | .     1 | ... branches: [HEAD@5586eb5cafa0aa0e718c2c1f3d5ba009caa0231f]
19:52:55.321877 | .  1281 | ... [git -c init.defaultBranch=nonExistentBranchBB0FOFCH32 init --bare /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]
19:52:55.326443 | .  4566 | ... success
19:52:55.326481 | .    38 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git -c protocol.version=2 -c http.extraHeader=X-Sourcegraph-Actor-UID: internal fetch --depth=1 http://sourcegraph-frontend-internal:80/.internal/git/github.com/vespa-engine/vespa 5586eb5cafa0aa0e718c2c1f3d5ba009caa0231f]
19:59:42.565318 | 407.238837 | ... success
19:59:42.565386 | .    68 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git update-ref HEAD 5586eb5cafa0aa0e718c2c1f3d5ba009caa0231f]
19:59:42.567955 | .  2569 | ... success
19:59:42.567999 | .    44 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.archived 0]
19:59:42.570859 | .  2860 | ... success
19:59:42.570895 | .    37 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.fork 0]
19:59:42.573350 | .  2455 | ... success
19:59:42.573383 | .    33 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.name github.com/vespa-engine/vespa]
19:59:42.576020 | .  2637 | ... success
19:59:42.576052 | .    32 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.priority 5559]
19:59:42.578298 | .  2245 | ... success
19:59:42.578350 | .    52 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.public 1]
19:59:42.580913 | .  2563 | ... success
19:59:42.580971 | .    58 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.repoid 1377362]
19:59:42.583445 | .  2474 | ... success
19:59:42.583508 | .    63 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,magik:scip,scala:scip,go:scip,typescript:scip,hack:scip,rust:scip,python:scip,c_sharp:scip,ruby:scip,kotlin:scip,javascript:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file pkgs/top-level/all-packages.nix -large_file extensions/**/*.d.ts -large_file src/vs/**/*.d.ts -large_file src/typings/**/*.d.ts -large_file src/vscode-dts/**/*.d.ts -large_file lib/**/*.d.ts -large_file src/lib/**/*.d.ts -shard_merging /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]
20:00:17.326306 | 34.742798 | ... success
20:00:17.378256 | . 51950 | ... state: success

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:

19:59:42.583508 | .    63 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,magik:scip,scala:scip,go:scip,typescript:scip,hack:scip,rust:scip,python:scip,c_sharp:scip,ruby:scip,kotlin:scip,javascript:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file pkgs/top-level/all-packages.nix -large_file extensions/**/*.d.ts -large_file src/vs/**/*.d.ts -large_file src/typings/**/*.d.ts -large_file src/vscode-dts/**/*.d.ts -large_file lib/**/*.d.ts -large_file src/lib/**/*.d.ts -shard_merging /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]

@@ -234,9 +251,16 @@ func gitIndex(c gitIndexConfig, o *indexArgs, sourcegraph Sourcegraph, l sglog.L
"-C", gitDir,
"-c", "protocol.version=2",
"-c", "http.extraHeader=X-Sourcegraph-Actor-UID: internal",
"fetch", "--depth=1", o.CloneURL,
"fetch", "--depth=1", "--no-tags",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

}

// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files.
if len(o.LargeFiles) == 0 {
fetchArgs = append(fetchArgs, "--filter=blob:limit=1m")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool! In the past I explored wanting to add a smarter way to fetch git bundles to handle things like the ignore rule. But given for most places this isn't set this is a great optimization.

@jtibshirani
Copy link
Member Author

Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:

Good point 🤔 Here's what I'll do: merge this as it's simple and safe, then temporarily remove large_file on dot com to test the effect. If it's a nice improvement, then loop back and figure out how to generalize this to work with large_file too.

@jtibshirani jtibshirani merged commit c284149 into main Sep 5, 2024
9 checks passed
@jtibshirani jtibshirani deleted the jtibs/fetch branch September 5, 2024 16:53
@keegancsmith
Copy link
Member

Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:

Good point 🤔 Here's what I'll do: merge this as it's simple and safe, then temporarily remove large_file on dot com to test the effect. If it's a nice improvement, then loop back and figure out how to generalize this to work with large_file too.

check in with code intel. Those excludes look very much like things which may help cross-repo code intel? Or maybe that is something from an old way we did things.

@jtibshirani
Copy link
Member Author

I blamed the dot com config, and these were added recently by Sourcegraphers wanting to search certain file types (for example https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/17613).

@keegancsmith
Copy link
Member

aah cool. We should probably look into a way to add those overrides just for those specific repos. But worth temporarily regressing to see the impact.

@jtibshirani
Copy link
Member Author

This helped reduce fetch times a bit. Here are the fetch durations (50th, 90th, 95th, and 99th) on dot com before and after the change.

Screenshot 2024-09-09 at 12 06 06 PM
Screenshot 2024-09-09 at 12 06 11 PM

However, there are still periods with really high variance that indicate something else is affecting fetch times. I also didn't see a big impact on overall indexing duration. So I'm not going to pursue this optimization further right now.

Screenshot 2024-09-09 at 1 49 01 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants