Avoid fetching files over size limit #814

jtibshirani · 2024-09-04T21:35:00Z

We never index files over 1MB, unless the "LargeFiles" allowlist is set. So in
most cases, we can avoid fetching them at all.

This PR updates the git fetch to filter out files over 1MB when possible, and
exclude tags. It also refactors the very long gitIndex method.

jtibshirani · 2024-09-04T21:39:06Z

To get a sense of the improvement, here's the difference in just running time git fetch... or the chromium repo on my laptop. Note: these timings are directionally correct, but not "tight" because of network variability.

git fetch --depth=1 [email protected]:chromium/chromium.git HEAD  33.86s user 5.62s system 31% cpu 2:06.73 total

git fetch --depth=1 --filter=blob:limit=1m  HEAD  26.72s user 4.14s system 31% cpu 1:38.35 total

I anticipate this will make a positive difference to indexing latency (but not huge). I like that it avoids pathological behavior for repos that have useless enormous files. And after looking at a bunch of traces on S2, it is clear that fetching significantly contributes to indexing latency, especially for "medium size" repos. Example:

2024/09/04 20:06:21.467710 | 169.461023 | github.com/llvm/llvm-project
-- | -- | --
20:06:21.467711 | .     1 | ... branches: [HEAD@d21e731c42d6b967e29dbe2edc16c1b86885df0d]
20:06:21.469175 | .  1465 | ... [git -c init.defaultBranch=nonExistentBranchBB0FOFCH32 init --bare /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git]
20:06:21.474080 | .  4904 | ... success
20:06:21.474109 | .    29 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git -c protocol.version=2 -c http.extraHeader=X-Sourcegraph-Actor-UID: internal fetch --depth=1 http://sourcegraph-frontend-internal:80/.internal/git/github.com/llvm/llvm-project d21e731c42d6b967e29dbe2edc16c1b86885df0d]
20:07:25.651004 | 64.176894 | ... success
20:07:25.651084 | .    80 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git update-ref HEAD d21e731c42d6b967e29dbe2edc16c1b86885df0d]
20:07:25.653199 | .  2115 | ... success
20:07:25.653239 | .    40 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.archived 0]
20:07:25.655049 | .  1810 | ... success
20:07:25.655089 | .    40 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.fork 0]
20:07:25.657153 | .  2065 | ... success
20:07:25.657194 | .    41 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.name github.com/llvm/llvm-project]
20:07:25.659307 | .  2113 | ... success
20:07:25.659332 | .    24 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.priority 27769]
20:07:25.660962 | .  1630 | ... success
20:07:25.660986 | .    24 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.public 1]
20:07:25.662670 | .  1685 | ... success
20:07:25.662694 | .    23 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git config zoekt.repoid 136896]
20:07:25.664282 | .  1588 | ... success
20:07:25.664311 | .    29 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,javascript:scip,python:scip,rust:scip,kotlin:scip,magik:scip,typescript:scip,hack:scip,scala:scip,go:scip,c_sharp:scip,ruby:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file **/fixtures.json -shard_merging /data/index/.indexserver.tmp/github.com%2Fllvm%2Fllvm-project.git]
20:09:10.871375 | 105.207064 | ... success
20:09:10.928730 | . 57355 | ... state: success

jtibshirani · 2024-09-04T21:39:35Z

cmd/zoekt-sourcegraph-indexserver/index.go

 		}

+		// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files.
+		if len(o.LargeFiles) == 0 {
+			fetchArgs = append(fetchArgs, "--filter=blob:limit=1m")


Here's the interesting part of the change.

very cool! In the past I explored wanting to add a smarter way to fetch git bundles to handle things like the ignore rule. But given for most places this isn't set this is a great optimization.

jtibshirani · 2024-09-04T21:40:58Z

gitindex/index.go

@@ -887,18 +887,19 @@ func createDocument(key fileKey,
 	opts build.Options,
 ) (zoekt.Document, error) {
 	blob, err := repos[key].Repo.BlobObject(key.ID)
+
+	// We filter out large documents when fetching the repo. So if an object is too large, it will not be found.
+	if errors.Is(err, plumbing.ErrObjectNotFound) {


Because it's just the blobs that are excluded, we still traverse all files in the commit tree and report "skipped docs" correctly. I tested this locally by indexing golang/go with and without this change, and checking that the number of docs matching "NOT-INDEXED: file size exceeds maximum" was identical.

mmanela · 2024-09-04T23:33:13Z

cmd/zoekt-sourcegraph-indexserver/index.go

 		}

+		// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files.
+		if len(o.LargeFiles) == 0 {


Does SG populate anything here by default?

No it does not. It requires a customer to modify the site config.

jtibshirani · 2024-09-04T23:33:44Z

Another example, this one from dot com where indexing is pretty fast but fetch takes a long time:


2024/09/04 19:52:55.320595 | 442.057663 | github.com/vespa-engine/vespa
-- | -- | --
19:52:55.320596 | .     1 | ... branches: [HEAD@5586eb5cafa0aa0e718c2c1f3d5ba009caa0231f]
19:52:55.321877 | .  1281 | ... [git -c init.defaultBranch=nonExistentBranchBB0FOFCH32 init --bare /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]
19:52:55.326443 | .  4566 | ... success
19:52:55.326481 | .    38 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git -c protocol.version=2 -c http.extraHeader=X-Sourcegraph-Actor-UID: internal fetch --depth=1 http://sourcegraph-frontend-internal:80/.internal/git/github.com/vespa-engine/vespa 5586eb5cafa0aa0e718c2c1f3d5ba009caa0231f]
19:59:42.565318 | 407.238837 | ... success
19:59:42.565386 | .    68 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git update-ref HEAD 5586eb5cafa0aa0e718c2c1f3d5ba009caa0231f]
19:59:42.567955 | .  2569 | ... success
19:59:42.567999 | .    44 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.archived 0]
19:59:42.570859 | .  2860 | ... success
19:59:42.570895 | .    37 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.fork 0]
19:59:42.573350 | .  2455 | ... success
19:59:42.573383 | .    33 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.name github.com/vespa-engine/vespa]
19:59:42.576020 | .  2637 | ... success
19:59:42.576052 | .    32 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.priority 5559]
19:59:42.578298 | .  2245 | ... success
19:59:42.578350 | .    52 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.public 1]
19:59:42.580913 | .  2563 | ... success
19:59:42.580971 | .    58 | ... [git -C /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git config zoekt.repoid 1377362]
19:59:42.583445 | .  2474 | ... success
19:59:42.583508 | .    63 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,magik:scip,scala:scip,go:scip,typescript:scip,hack:scip,rust:scip,python:scip,c_sharp:scip,ruby:scip,kotlin:scip,javascript:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file pkgs/top-level/all-packages.nix -large_file extensions/**/*.d.ts -large_file src/vs/**/*.d.ts -large_file src/typings/**/*.d.ts -large_file src/vscode-dts/**/*.d.ts -large_file lib/**/*.d.ts -large_file src/lib/**/*.d.ts -shard_merging /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]
20:00:17.326306 | 34.742798 | ... success
20:00:17.378256 | . 51950 | ... state: success

keegancsmith

cool! Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:

19:59:42.583508 | .    63 | ... [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map zig:scip,magik:scip,scala:scip,go:scip,typescript:scip,hack:scip,rust:scip,python:scip,c_sharp:scip,ruby:scip,kotlin:scip,javascript:scip -file_limit 1048576 -parallelism 4 -index /data/index -require_ctags -large_file pkgs/top-level/all-packages.nix -large_file extensions/**/*.d.ts -large_file src/vs/**/*.d.ts -large_file src/typings/**/*.d.ts -large_file src/vscode-dts/**/*.d.ts -large_file lib/**/*.d.ts -large_file src/lib/**/*.d.ts -shard_merging /data/index/.indexserver.tmp/github.com%2Fvespa-engine%2Fvespa.git]

keegancsmith · 2024-09-05T07:36:15Z

cmd/zoekt-sourcegraph-indexserver/index.go

@@ -234,9 +251,16 @@ func gitIndex(c gitIndexConfig, o *indexArgs, sourcegraph Sourcegraph, l sglog.L
 			"-C", gitDir,
 			"-c", "protocol.version=2",
 			"-c", "http.extraHeader=X-Sourcegraph-Actor-UID: internal",
-			"fetch", "--depth=1", o.CloneURL,
+			"fetch", "--depth=1", "--no-tags",


keegancsmith · 2024-09-05T07:37:18Z

cmd/zoekt-sourcegraph-indexserver/index.go

 		}

+		// If there are no exceptions to MaxFileSize (1MB), we can avoid fetching these large files.
+		if len(o.LargeFiles) == 0 {
+			fetchArgs = append(fetchArgs, "--filter=blob:limit=1m")


very cool! In the past I explored wanting to add a smarter way to fetch git bundles to handle things like the ignore rule. But given for most places this isn't set this is a great optimization.

jtibshirani · 2024-09-05T16:52:37Z

Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:

Good point 🤔 Here's what I'll do: merge this as it's simple and safe, then temporarily remove large_file on dot com to test the effect. If it's a nice improvement, then loop back and figure out how to generalize this to work with large_file too.

keegancsmith · 2024-09-05T17:02:57Z

Althought I think we do have large files configured on dotcom? Looking at your one trace example from dotcom I see -large_file flags:

Good point 🤔 Here's what I'll do: merge this as it's simple and safe, then temporarily remove large_file on dot com to test the effect. If it's a nice improvement, then loop back and figure out how to generalize this to work with large_file too.

check in with code intel. Those excludes look very much like things which may help cross-repo code intel? Or maybe that is something from an old way we did things.

jtibshirani · 2024-09-05T17:31:37Z

I blamed the dot com config, and these were added recently by Sourcegraphers wanting to search certain file types (for example https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/17613).

keegancsmith · 2024-09-05T17:54:07Z

aah cool. We should probably look into a way to add those overrides just for those specific repos. But worth temporarily regressing to see the impact.

jtibshirani · 2024-09-09T20:49:32Z

This helped reduce fetch times a bit. Here are the fetch durations (50th, 90th, 95th, and 99th) on dot com before and after the change.

However, there are still periods with really high variance that indicate something else is affecting fetch times. I also didn't see a big impact on overall indexing duration. So I'm not going to pursue this optimization further right now.

Avoid fetching files over size limit

f3e1e43

cla-bot bot added the cla-signed label Sep 4, 2024

jtibshirani commented Sep 4, 2024

View reviewed changes

jtibshirani requested a review from a team September 4, 2024 21:44

mmanela reviewed Sep 4, 2024

View reviewed changes

keegancsmith approved these changes Sep 5, 2024

View reviewed changes

jtibshirani merged commit c284149 into main Sep 5, 2024
9 checks passed

jtibshirani deleted the jtibs/fetch branch September 5, 2024 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid fetching files over size limit #814

Avoid fetching files over size limit #814

jtibshirani commented Sep 4, 2024 •

edited

Loading

jtibshirani commented Sep 4, 2024 •

edited

Loading

jtibshirani Sep 4, 2024

keegancsmith Sep 5, 2024

jtibshirani Sep 4, 2024

mmanela Sep 4, 2024

jtibshirani Sep 5, 2024

jtibshirani commented Sep 4, 2024

keegancsmith left a comment

keegancsmith Sep 5, 2024

keegancsmith Sep 5, 2024

jtibshirani commented Sep 5, 2024

keegancsmith commented Sep 5, 2024

jtibshirani commented Sep 5, 2024

keegancsmith commented Sep 5, 2024

jtibshirani commented Sep 9, 2024

Avoid fetching files over size limit #814

Avoid fetching files over size limit #814

Conversation

jtibshirani commented Sep 4, 2024 • edited Loading

jtibshirani commented Sep 4, 2024 • edited Loading

jtibshirani Sep 4, 2024

Choose a reason for hiding this comment

keegancsmith Sep 5, 2024

Choose a reason for hiding this comment

jtibshirani Sep 4, 2024

Choose a reason for hiding this comment

mmanela Sep 4, 2024

Choose a reason for hiding this comment

jtibshirani Sep 5, 2024

Choose a reason for hiding this comment

jtibshirani commented Sep 4, 2024

keegancsmith left a comment

Choose a reason for hiding this comment

keegancsmith Sep 5, 2024

Choose a reason for hiding this comment

keegancsmith Sep 5, 2024

Choose a reason for hiding this comment

jtibshirani commented Sep 5, 2024

keegancsmith commented Sep 5, 2024

jtibshirani commented Sep 5, 2024

keegancsmith commented Sep 5, 2024

jtibshirani commented Sep 9, 2024

jtibshirani commented Sep 4, 2024 •

edited

Loading

jtibshirani commented Sep 4, 2024 •

edited

Loading