globwalk looks at all files even when it doesn't need to #29

jyn514 · 2020-06-28T16:33:21Z

https://docs.rs/globwalk/0.8.0/src/globwalk/lib.rs.html#355

If the top-level directory is not a match, no subdirectories or files in the directory will be a match either. However globwalk will still look at them. If there are many files in the ignored directory, this can cause enormous slowdowns. In https://github.com/rust-lang/docs.rs/pull/861/files#diff-ee6431d852ce8913514eece9e3982d32R96-R99, we have several hundred thousand files in subdirectories, but only ~10 in the matched directory, so this causes a slowdown of several orders of magnitude.

The text was updated successfully, but these errors were encountered:

jyn514 · 2020-06-28T16:34:24Z

I'm willing to work on this.

robinfriedli · 2021-05-11T21:10:28Z

Same here. In my case I'm using Tera, which uses globwalk to handle globs and I noticed that my application is practically unusable when running in a Docker container where the working directory is the root directory of the container because globwalk spends hours walking directories like /sys, I've never even let it finish so who knows how much longer it would have spent scanning my entire file system when it really should just find two html files in src/resources/templates/*.html. I haven't noticed it when running it locally but even then it walks through the entire target/ directory.

Luckily, the fix appears to be rather simple. The iterator does not handle the case where ignore::overrides::Override::matched returns Match::None and only skips the directory if it returns Match::Ignore. So adding the following after line 410 should suffice:

Match::None if is_dir => {
    skip_dir = true;
    continue 'skipper;
}

Then globwalk seems to skip irrelevant directories as expected.

Gilnaa · 2021-05-11T21:22:40Z

@robinfriedli thank you for working on it!
I'll set a reminder to tix this in the upcoming days

Gilnaa · 2021-05-11T21:35:43Z

After a brief check, it sadly doesn't work exactly like that, and filters out correct files.
After applying the patch, the following tests fail after not finding any files:

failures:
    tests::test_blacklist
    tests::test_case_insensitive_matching
    tests::test_from_patterns

It looks like Match::None means that the path (the directory) did not match any of the paths, but it still can be a prefix of a correct match.

Can you supply an example where this worked for you?

robinfriedli · 2021-05-11T22:10:04Z

Yeah I jumped the gun a bit here. It is not quite as simple unfortunately as the directory being part of the pattern also results in a Match::None. Maybe it would be enough to check whether the directory is a prefix of the glob?

Gilnaa · 2021-05-11T22:41:38Z

We can construct a second glob for dirs that contains the prefixes of the original positive patterns.
Not sure this cover all cases; especially WRT to **, but it's a good start

robinfriedli · 2021-05-11T23:27:28Z

I thought of that but do we actually want to yield those directories? I guess we could keep those globs separate and only use them to decide on whether to skip. I'm not 100% sure either but creating a glob for each child path (using the path components) might work. For example, for the glob **/templates/*.html we would create the globs **/ and **/templates/, which would match any directory.

robinfriedli · 2021-05-12T00:31:59Z

That does seem to work to some extent, at least all tests pass (except for the readme test) and it does seem to skip irrelevant directories. I can open a PR if you like.

Gilnaa · 2021-05-12T00:33:26Z

Gladly. I'll try to attack it with tests over the weekend to see if we missed something.

Thanks again!

robinfriedli · 2021-05-12T01:42:58Z

No problem 😄Pull request opened

JohnAZoidberg · 2022-03-24T11:03:59Z

Same here. In my case I'm using Tera, which uses globwalk to handle globs and I noticed that my application is practically unusable when running in a Docker container where the working directory is the root directory of the container because globwalk spends hours walking directories like /sys

Wow, I just ran into this exact same issue. It probably would never finish due to recursive symlinks.

robinfriedli mentioned this issue May 12, 2021

skip walking irrelevant directories when matching globs #31

Open

mre mentioned this issue Jan 27, 2022

Add html5gum as alternative link extractor lycheeverse/lychee#480

Merged

PAStheLoD mentioned this issue Jul 30, 2022

loading template files can easily get into an infinite loop Keats/tera#740

Open

mtkennerly mentioned this issue Oct 3, 2022

Proton initialized title files not found when casing does not match. mtkennerly/ludusavi#55

Closed

billti mentioned this issue Oct 17, 2023

Introduce Q# project structure microsoft/qsharp#794

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

globwalk looks at all files even when it doesn't need to #29

globwalk looks at all files even when it doesn't need to #29

jyn514 commented Jun 28, 2020

jyn514 commented Jun 28, 2020

robinfriedli commented May 11, 2021

Gilnaa commented May 11, 2021

Gilnaa commented May 11, 2021

robinfriedli commented May 11, 2021

Gilnaa commented May 11, 2021

robinfriedli commented May 11, 2021

robinfriedli commented May 12, 2021

Gilnaa commented May 12, 2021

robinfriedli commented May 12, 2021

JohnAZoidberg commented Mar 24, 2022

globwalk looks at all files even when it doesn't need to #29

globwalk looks at all files even when it doesn't need to #29

Comments

jyn514 commented Jun 28, 2020

jyn514 commented Jun 28, 2020

robinfriedli commented May 11, 2021

Gilnaa commented May 11, 2021

Gilnaa commented May 11, 2021

robinfriedli commented May 11, 2021

Gilnaa commented May 11, 2021

robinfriedli commented May 11, 2021

robinfriedli commented May 12, 2021

Gilnaa commented May 12, 2021

robinfriedli commented May 12, 2021

JohnAZoidberg commented Mar 24, 2022