-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid fetching files over size limit #814
Conversation
To get a sense of the improvement, here's the difference in just running
I anticipate this will make a positive difference to indexing latency (but not huge). I like that it avoids pathological behavior for repos that have useless enormous files. And after looking at a bunch of traces on S2, it is clear that fetching significantly contributes to indexing latency, especially for "medium size" repos. Example:
|
} | ||
|
||
// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files. | ||
if len(o.LargeFiles) == 0 { | ||
fetchArgs = append(fetchArgs, "--filter=blob:limit=1m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the interesting part of the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool! In the past I explored wanting to add a smarter way to fetch git bundles to handle things like the ignore rule. But given for most places this isn't set this is a great optimization.
@@ -887,18 +887,19 @@ func createDocument(key fileKey, | |||
opts build.Options, | |||
) (zoekt.Document, error) { | |||
blob, err := repos[key].Repo.BlobObject(key.ID) | |||
|
|||
// We filter out large documents when fetching the repo. So if an object is too large, it will not be found. | |||
if errors.Is(err, plumbing.ErrObjectNotFound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's just the blobs that are excluded, we still traverse all files in the commit tree and report "skipped docs" correctly. I tested this locally by indexing golang/go
with and without this change, and checking that the number of docs matching "NOT-INDEXED: file size exceeds maximum" was identical.
} | ||
|
||
// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files. | ||
if len(o.LargeFiles) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SG populate anything here by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it does not. It requires a customer to modify the site config.
Another example, this one from dot com where indexing is pretty fast but fetch takes a long time:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:
19:59:42.583508 | . 63 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,magik:scip,scala:scip,go:scip,typescript:scip,hack:scip,rust:scip,python:scip,c_sharp:scip,ruby:scip,kotlin:scip,javascript:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file pkgs/top-level/all-packages.nix -large_file extensions/**/*.d.ts -large_file src/vs/**/*.d.ts -large_file src/typings/**/*.d.ts -large_file src/vscode-dts/**/*.d.ts -large_file lib/**/*.d.ts -large_file src/lib/**/*.d.ts -shard_merging /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]
@@ -234,9 +251,16 @@ func gitIndex(c gitIndexConfig, o *indexArgs, sourcegraph Sourcegraph, l sglog.L | |||
"-C", gitDir, | |||
"-c", "protocol.version=2", | |||
"-c", "http.extraHeader=X-Sourcegraph-Actor-UID: internal", | |||
"fetch", "--depth=1", o.CloneURL, | |||
"fetch", "--depth=1", "--no-tags", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
} | ||
|
||
// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files. | ||
if len(o.LargeFiles) == 0 { | ||
fetchArgs = append(fetchArgs, "--filter=blob:limit=1m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool! In the past I explored wanting to add a smarter way to fetch git bundles to handle things like the ignore rule. But given for most places this isn't set this is a great optimization.
Good point 🤔 Here's what I'll do: merge this as it's simple and safe, then temporarily remove |
check in with code intel. Those excludes look very much like things which may help cross-repo code intel? Or maybe that is something from an old way we did things. |
I blamed the dot com config, and these were added recently by Sourcegraphers wanting to search certain file types (for example https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/17613). |
aah cool. We should probably look into a way to add those overrides just for those specific repos. But worth temporarily regressing to see the impact. |
This helped reduce fetch times a bit. Here are the fetch durations (50th, 90th, 95th, and 99th) on dot com before and after the change. However, there are still periods with really high variance that indicate something else is affecting fetch times. I also didn't see a big impact on overall indexing duration. So I'm not going to pursue this optimization further right now. |
We never index files over 1MB, unless the "LargeFiles" allowlist is set. So in
most cases, we can avoid fetching them at all.
This PR updates the
git fetch
to filter out files over 1MB when possible, andexclude tags. It also refactors the very long
gitIndex
method.