linter: Optimize string lookups with better data structure #6408

camchenry · 2024-10-09T23:33:03Z

There are many cases in the linter where store Vec<String>, Vec<&str>, or Vec<CompactStr> and then later check if vec.contains(something). At runtime, this takes O(n) to find if the given value is in the array. However, we can do better by replacing Vec with a different data structure.

`FxHashSet`

If we replace Vec with FxHashSet, we get the benefit of lookups taking O(1) time, at the cost of inserts taking longer due to hashing.

Sorted `Vec`

If we sort the Vec after inserting values, then we can use binary_search to look up if the element is in the array, which take O(log(n)) time. The sorting process itself is additional overhead of O(n log(n)), but might be negligible if elements are not inserted to the array afterwards. Otherwise, if elements are inserted later, it is probably faster to use a set.

The ideal data structure here probably depends on the usecase. For re-inserting values, a HashSet might be better. For a smaller set of values that don't change, a sorted Vec might be better.

The text was updated successfully, but these errors were encountered:

camchenry · 2024-10-09T23:38:51Z

One question I have related to this is: should we have a custom data structure in oxc for representing sets of strings? Pre-sorting the Vec has some nice benefits if we don't reinsert data, but has the downside of not enforcing that we don't insert elements later, which as the binary_search docs say, renders the result useless:

Binary searches this slice for a given element. If the slice is not sorted, the returned result is unspecified and meaningless.

Perhaps we could add a new type named someting like StringSet/ReadonlySet/ConstSet/SortedVec which has the behavior of:

Acts the same as Vec
Guarantees that data is always in sorted order when read
Methods that insert data such as .insert will throw an error
Has methods for explicitly inserting data such as .insert_and_sort()
Calling .contains() maps to .binary_search()

camchenry · 2024-10-10T03:55:09Z

To help us make better decisions here, I did some microbenchmarking on my laptop. I compared doing contains on a Vec of strings, binary_search on a sorted vec, and doing .contains on a FxHashSet.

The results are approximately as follows:

benchmark	Elements	Time
`Vec.contains`	1	6.3ns/iter ⭐
`Vec.binary_search`	1	12.9ns/iter
`HashSet.contains`	1	16.0ns/iter
`Vec.contains`	1 (CompactStr)	8.7ns/iter ⭐
`Vec.binary_search`	1 (CompactStr)	14.3ns/iter
`HashSet.contains`	1 (CompactStr)	19.5ns/iter

benchmark	Elements	Time
`Vec.contains`	4	13.8ns/iter ⭐
`Vec.binary_search`	4	32.35ns/iter
`HashSet.contains`	4	17.5ns/iter
`Vec.contains`	4 (CompactStr)	21.5ns/iter ⭐
`Vec.binary_search`	4 (CompactStr)	37.4ns/iter
`HashSet.contains`	4 (CompactStr)	22.5ns/iter

benchmark	Elements	Time
`Vec.contains`	60	59.4ns/iter
`Vec.binary_search`	60	91.7ns/iter
`HashSet.contains`	60	17.3ns/iter ⭐
`Vec.contains`	60 (CompactStr)	152.8ns/iter
`Vec.binary_search`	60 (CompactStr)	96.1ns/iter
`HashSet.contains`	60 (CompactStr)	19.5ns/iter ⭐

benchmark	Elements	Time
`Vec.contains`	256	247.9ns/iter
`Vec.binary_search`	256	118.3ns/iter
`HashSet.contains`	256	16.9ns/iter ⭐
`Vec.contains`	256 (CompactStr)	630.9ns/iter
`Vec.binary_search`	256 (CompactStr)	124.2ns/iter
`HashSet.contains`	256 (CompactStr)	19.4ns/iter ⭐

The results here vary somewhat, but after running this a number of times (several billion iterations on each benchmark), my general takeaways are this:

For a single element, .contains is definitely the fastest, no question about it.
For a 4 elements, .contains is still usually the fastest, but sometimes .binary_search is faster.
For 16+ elements, FxHashSet is unquestionably the fastest option
CompactStr does not seem to be zero cost. It may improve memory usage, but there does seem to be a negative runtime impact that ranges from negligible to considerable.

This suggests to me that maybe we need some sort of hybrid data structure that automatically reallocates/reorganizes as the size of the array grows?

For 0-4 elements, it is stored as unsorted and doing .contains calls .contains
For 4-16 elements, it is stored as sorted, and doing .contains calls .binary_search
For 16+ elements, it is stored as a hash set, and doing .contains calls HashSet.contains

DonIsaac · 2024-10-10T14:00:14Z

These are surprising results, and extremely informative. Most configs have only a small number of elements.

Excellent stuff @camchenry

Dunqing · 2024-10-10T14:12:55Z

Great!

Maybe you are interested in this topic as well @overlookmotel cc.

Boshen · 2024-10-10T14:38:59Z

I purposely used Vec::contains because they are the fastest on smallet elements, as most configs are.

overlookmotel · 2024-10-11T10:56:16Z

@camchenry Thanks very much for this extremely thorough investigation. Very interesting.

A few thoughts and responses:

There are many cases in the linter...

Can you give some examples? Everything I write below is in the abstract, but I'm not actually sure what we're talking about in practice.

For a 4 elements, .contains is still usually the fastest, but sometimes .binary_search is faster.

I didn't draw this conclusion from your bench results above. It looks more like that .binary_search is never faster. Vec::contains or HashSet::contains is always the winner for any sized collection. Did you have some other findings beyond the perf table above which found .binary_search to be better in some circumstances?

CompactStr does not seem to be zero cost. It may improve memory usage, but there does seem to be a negative runtime impact that ranges from negligible to considerable.

Yes. The advantage of CompactStr is that it stores small strings inline so:

less memory usage
less allocations
no further memory read required to read the string, as the string data is right there in the CompactStr.

Less allocations is probably the most significant of these advantages.

But the downside is that every time you read a CompactStr, CompactStr::as_str has to check "is this string stored inline or not?" That involves more instructions. CompactStr::as_str at least manages to make that check branchless by clever use of "conditional move" instructions. But it's still not so cheap.

So whether CompactStr is overall a benefit or not depends very much on the specific use case.

Personally, I think it's ideal to avoid CompactStr as much as possible if the strings are already in AST. In that case, Atom (which is just a wrapper around &str at present) does not use any more memory, and does not cause allocations. So it has most of the advantages of CompactStr, without the downsides.

Atom, however, has it's own downside in terms of ergonomics. It has a lifetime Atom<'a>, so it's less simple to use than the lifetime-less CompactStr.

some sort of hybrid data structure that automatically reallocates/reorganizes as the size of the array grows

This is an attractive idea, and in some cases may be a good choice. However, be aware that this structure would have its own downsides:

Every .contains call has to check "am I stored as a Vec or a HashSet?". This is similar to the overhead of CompactStr discussed above. That'll be quite an unpredictable branch (and unpredictable branches are quite costly).
If it's a Vec initially, but grows and needs to convert itself to a HashSet, that conversion is not so cheap either. It would be less costly if it was just a HashSet from the start.

So it may be that rather than having a "one size fit all" solution, we're better off using a heuristic in each use case (how big is this thing usually?) and choosing Vec or HashSet for that use case accordingly.

One other thing to bear in mind is that when a HashSet grows beyond it's capacity, every item in the HashSet has to have its hash recalculated. FxHash is a pretty cheap hash function (only 3 instructions to hash a u32 or u64) but hashing strings is much more costly.

A further consideration is that I think at some point, we'll likely introduce a new Ident type which stores the hash inline, which will make FxHashSet<Ident> much cheaper. But I'm not sure when we'll get to implementing that.

TLDR

Sorry I've written an essay. Your investigations raise some pretty fundamental and complicated questions!

What can we do in practice?

The microbenchmarks are helpful, but I think we'd also need to try in practice replacing some Vecs with HashSets and see what "real" benchmarks say. With this stuff, often you can't really know how perf will be affected without trying and measuring.
Replace CompactStr with Atom wherever possible. I suspect that'll be a win either way.
We could consider trying to advance the Ident idea if we want to store strings in HashSets.

camchenry · 2024-10-11T17:04:58Z

There are many cases in the linter...

Can you give some examples? Everything I write below is in the abstract, but I'm not actually sure what we're talking about in practice.

I should have included this in the original issue, I mainly filed this in response to some PR feedback from @DonIsaac that I wanted to do more investigation on.

Here are some examples where we are using Vec.contains in the linter:

oxc/crates/oxc_linter/src/rules/eslint/no_new_func.rs

Line 82 in 7645e5c

if !["apply", "bind", "call"].contains(static_property_name) {
oxc/crates/oxc_linter/src/rules/jest/valid_expect.rs

Line 380 in 7645e5c

if ["then", "catch"].contains(&name) {
oxc/crates/oxc_linter/src/rules/promise/prefer_await_to_callbacks.rs

Line 139 in 7645e5c

["map", "every", "forEach", "some", "find", "filter"].contains(&name)
oxc/crates/oxc_linter/src/rules/jest/no_hooks.rs

Line 123 in 7645e5c

if !self.allow.contains(&name) {
oxc/crates/oxc_linter/src/rules/eslint/no_global_assign.rs

Line 67 in 7645e5c

if !self.excludes.contains(&CompactStr::from(name))
oxc/crates/oxc_linter/src/rules/jest/expect_expect.rs

Line 130 in 7645e5c

) || rule.additional_test_block_functions.contains(&name)

For a 4 elements, .contains is still usually the fastest, but sometimes .binary_search is faster.

I didn't draw this conclusion from your bench results above. It looks more like that .binary_search is never faster. Vec::contains or HashSet::contains is always the winner for any sized collection. Did you have some other findings beyond the perf table above which found .binary_search to be better in some circumstances?

This was just taken from the last benchmarking run that I did: there were some other cases where binary_search was slightly faster, but it wasn't consistent. I suspect it might be faster in some cases in the 8-64 element range, but once you get beyond 16+ elements it's probably better to just use a hash set anyway.

So it may be that rather than having a "one size fit all" solution, we're better off using a heuristic in each use case (how big is this thing usually?) and choosing Vec or HashSet for that use case accordingly.

I agree. I did some follow-up benchmarking where I created this hybrid data structure and did some basic testing on it. In practice, it was not faster than just carefully choosing a Vec or FxHashSet to begin with.

The microbenchmarks are helpful, but I think we'd also need to try in practice replacing some Vecs with HashSets and see what "real" benchmarks say. With this stuff, often you can't really know how perf will be affected without trying and measuring.

Again, I agree. My profiling hasn't shown Vec.contains to ever be a huge bottleneck, compared to things like memmove or free for allocating and deallocating which take up much more time.

Replace CompactStr with Atom wherever possible. I suspect that'll be a win either way.

Good to know, this won't be an option in all cases since many Vec<CompactStr> are read from the configuration file, but I bet there are some small wins to be found here.

We could consider trying to advance the Ident idea if we want to store strings in HashSets.

This seems like something that would be nice to have. But again, the profiles that I have seen don't show checking if a string is in a set to be a significant bottleneck for linting.

I think we could probably save more time if we just avoided doing string lookups altogether by returning earlier, using different algorithms, etc.

Boshen · 2024-10-14T01:31:48Z

Close as not planned as I see no actionable items coming out of the investigation.

It seems like the rule of thumb is use contains given most configs is a small number of values.

camchenry added A-linter Area - Linter C-performance Category - Solution not expected to change functional behavior, only performance labels Oct 9, 2024

camchenry mentioned this issue Oct 9, 2024

perf(linter): use CompactStr in no-bitwise #6398

Merged

camchenry self-assigned this Oct 10, 2024

overlookmotel mentioned this issue Oct 11, 2024

refactor(linter): no_global_assign rule: reduce name lookups #6460

Merged

Boshen closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linter: Optimize string lookups with better data structure #6408

linter: Optimize string lookups with better data structure #6408

camchenry commented Oct 9, 2024 •

edited

Loading

camchenry commented Oct 9, 2024

camchenry commented Oct 10, 2024 •

edited

Loading

DonIsaac commented Oct 10, 2024

Dunqing commented Oct 10, 2024

Boshen commented Oct 10, 2024

overlookmotel commented Oct 11, 2024

camchenry commented Oct 11, 2024

Boshen commented Oct 14, 2024

linter: Optimize string lookups with better data structure #6408

linter: Optimize string lookups with better data structure #6408

Comments

camchenry commented Oct 9, 2024 • edited Loading

FxHashSet

Sorted Vec

camchenry commented Oct 9, 2024

camchenry commented Oct 10, 2024 • edited Loading

DonIsaac commented Oct 10, 2024

Dunqing commented Oct 10, 2024

Boshen commented Oct 10, 2024

overlookmotel commented Oct 11, 2024

TLDR

camchenry commented Oct 11, 2024

Boshen commented Oct 14, 2024

camchenry commented Oct 9, 2024 •

edited

Loading

`FxHashSet`

Sorted `Vec`

camchenry commented Oct 10, 2024 •

edited

Loading