Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linter: Optimize string lookups with better data structure #6408

Closed
camchenry opened this issue Oct 9, 2024 · 8 comments
Closed

linter: Optimize string lookups with better data structure #6408

camchenry opened this issue Oct 9, 2024 · 8 comments
Assignees
Labels
A-linter Area - Linter C-performance Category - Solution not expected to change functional behavior, only performance

Comments

@camchenry
Copy link
Contributor

camchenry commented Oct 9, 2024

There are many cases in the linter where store Vec<String>, Vec<&str>, or Vec<CompactStr> and then later check if vec.contains(something). At runtime, this takes O(n) to find if the given value is in the array. However, we can do better by replacing Vec with a different data structure.

FxHashSet

If we replace Vec with FxHashSet, we get the benefit of lookups taking O(1) time, at the cost of inserts taking longer due to hashing.

Sorted Vec

If we sort the Vec after inserting values, then we can use binary_search to look up if the element is in the array, which take O(log(n)) time. The sorting process itself is additional overhead of O(n log(n)), but might be negligible if elements are not inserted to the array afterwards. Otherwise, if elements are inserted later, it is probably faster to use a set.


The ideal data structure here probably depends on the usecase. For re-inserting values, a HashSet might be better. For a smaller set of values that don't change, a sorted Vec might be better.

@camchenry camchenry added A-linter Area - Linter C-performance Category - Solution not expected to change functional behavior, only performance labels Oct 9, 2024
@camchenry
Copy link
Contributor Author

One question I have related to this is: should we have a custom data structure in oxc for representing sets of strings? Pre-sorting the Vec has some nice benefits if we don't reinsert data, but has the downside of not enforcing that we don't insert elements later, which as the binary_search docs say, renders the result useless:

Binary searches this slice for a given element. If the slice is not sorted, the returned result is unspecified and meaningless.

Perhaps we could add a new type named someting like StringSet/ReadonlySet/ConstSet/SortedVec which has the behavior of:

  • Acts the same as Vec
  • Guarantees that data is always in sorted order when read
  • Methods that insert data such as .insert will throw an error
  • Has methods for explicitly inserting data such as .insert_and_sort()
  • Calling .contains() maps to .binary_search()

@camchenry
Copy link
Contributor Author

camchenry commented Oct 10, 2024

To help us make better decisions here, I did some microbenchmarking on my laptop. I compared doing contains on a Vec of strings, binary_search on a sorted vec, and doing .contains on a FxHashSet.

The results are approximately as follows:

benchmark Elements Time
Vec.contains 1 6.3ns/iter ⭐
Vec.binary_search 1 12.9ns/iter
HashSet.contains 1 16.0ns/iter
Vec.contains 1 (CompactStr) 8.7ns/iter ⭐
Vec.binary_search 1 (CompactStr) 14.3ns/iter
HashSet.contains 1 (CompactStr) 19.5ns/iter
benchmark Elements Time
Vec.contains 4 13.8ns/iter ⭐
Vec.binary_search 4 32.35ns/iter
HashSet.contains 4 17.5ns/iter
Vec.contains 4 (CompactStr) 21.5ns/iter ⭐
Vec.binary_search 4 (CompactStr) 37.4ns/iter
HashSet.contains 4 (CompactStr) 22.5ns/iter
benchmark Elements Time
Vec.contains 60 59.4ns/iter
Vec.binary_search 60 91.7ns/iter
HashSet.contains 60 17.3ns/iter ⭐
Vec.contains 60 (CompactStr) 152.8ns/iter
Vec.binary_search 60 (CompactStr) 96.1ns/iter
HashSet.contains 60 (CompactStr) 19.5ns/iter ⭐
benchmark Elements Time
Vec.contains 256 247.9ns/iter
Vec.binary_search 256 118.3ns/iter
HashSet.contains 256 16.9ns/iter ⭐
Vec.contains 256 (CompactStr) 630.9ns/iter
Vec.binary_search 256 (CompactStr) 124.2ns/iter
HashSet.contains 256 (CompactStr) 19.4ns/iter ⭐

The results here vary somewhat, but after running this a number of times (several billion iterations on each benchmark), my general takeaways are this:

  • For a single element, .contains is definitely the fastest, no question about it.
  • For a 4 elements, .contains is still usually the fastest, but sometimes .binary_search is faster.
  • For 16+ elements, FxHashSet is unquestionably the fastest option
  • CompactStr does not seem to be zero cost. It may improve memory usage, but there does seem to be a negative runtime impact that ranges from negligible to considerable.

This suggests to me that maybe we need some sort of hybrid data structure that automatically reallocates/reorganizes as the size of the array grows?

  • For 0-4 elements, it is stored as unsorted and doing .contains calls .contains
  • For 4-16 elements, it is stored as sorted, and doing .contains calls .binary_search
  • For 16+ elements, it is stored as a hash set, and doing .contains calls HashSet.contains

@DonIsaac
Copy link
Contributor

These are surprising results, and extremely informative. Most configs have only a small number of elements.

Excellent stuff @camchenry

@Dunqing
Copy link
Member

Dunqing commented Oct 10, 2024

Great!

Maybe you are interested in this topic as well @overlookmotel cc.

@Boshen
Copy link
Member

Boshen commented Oct 10, 2024

I purposely used Vec::contains because they are the fastest on smallet elements, as most configs are.

@overlookmotel
Copy link
Contributor

@camchenry Thanks very much for this extremely thorough investigation. Very interesting.

A few thoughts and responses:

There are many cases in the linter...

Can you give some examples? Everything I write below is in the abstract, but I'm not actually sure what we're talking about in practice.

For a 4 elements, .contains is still usually the fastest, but sometimes .binary_search is faster.

I didn't draw this conclusion from your bench results above. It looks more like that .binary_search is never faster. Vec::contains or HashSet::contains is always the winner for any sized collection. Did you have some other findings beyond the perf table above which found .binary_search to be better in some circumstances?

CompactStr does not seem to be zero cost. It may improve memory usage, but there does seem to be a negative runtime impact that ranges from negligible to considerable.

Yes. The advantage of CompactStr is that it stores small strings inline so:

  1. less memory usage
  2. less allocations
  3. no further memory read required to read the string, as the string data is right there in the CompactStr.

Less allocations is probably the most significant of these advantages.

But the downside is that every time you read a CompactStr, CompactStr::as_str has to check "is this string stored inline or not?" That involves more instructions. CompactStr::as_str at least manages to make that check branchless by clever use of "conditional move" instructions. But it's still not so cheap.

So whether CompactStr is overall a benefit or not depends very much on the specific use case.

Personally, I think it's ideal to avoid CompactStr as much as possible if the strings are already in AST. In that case, Atom (which is just a wrapper around &str at present) does not use any more memory, and does not cause allocations. So it has most of the advantages of CompactStr, without the downsides.

Atom, however, has it's own downside in terms of ergonomics. It has a lifetime Atom<'a>, so it's less simple to use than the lifetime-less CompactStr.

some sort of hybrid data structure that automatically reallocates/reorganizes as the size of the array grows

This is an attractive idea, and in some cases may be a good choice. However, be aware that this structure would have its own downsides:

  1. Every .contains call has to check "am I stored as a Vec or a HashSet?". This is similar to the overhead of CompactStr discussed above. That'll be quite an unpredictable branch (and unpredictable branches are quite costly).
  2. If it's a Vec initially, but grows and needs to convert itself to a HashSet, that conversion is not so cheap either. It would be less costly if it was just a HashSet from the start.

So it may be that rather than having a "one size fit all" solution, we're better off using a heuristic in each use case (how big is this thing usually?) and choosing Vec or HashSet for that use case accordingly.

One other thing to bear in mind is that when a HashSet grows beyond it's capacity, every item in the HashSet has to have its hash recalculated. FxHash is a pretty cheap hash function (only 3 instructions to hash a u32 or u64) but hashing strings is much more costly.

A further consideration is that I think at some point, we'll likely introduce a new Ident type which stores the hash inline, which will make FxHashSet<Ident> much cheaper. But I'm not sure when we'll get to implementing that.

TLDR

Sorry I've written an essay. Your investigations raise some pretty fundamental and complicated questions!

What can we do in practice?

  1. The microbenchmarks are helpful, but I think we'd also need to try in practice replacing some Vecs with HashSets and see what "real" benchmarks say. With this stuff, often you can't really know how perf will be affected without trying and measuring.
  2. Replace CompactStr with Atom wherever possible. I suspect that'll be a win either way.
  3. We could consider trying to advance the Ident idea if we want to store strings in HashSets.

@camchenry
Copy link
Contributor Author

There are many cases in the linter...

Can you give some examples? Everything I write below is in the abstract, but I'm not actually sure what we're talking about in practice.

I should have included this in the original issue, I mainly filed this in response to some PR feedback from @DonIsaac that I wanted to do more investigation on.

Here are some examples where we are using Vec.contains in the linter:

For a 4 elements, .contains is still usually the fastest, but sometimes .binary_search is faster.

I didn't draw this conclusion from your bench results above. It looks more like that .binary_search is never faster. Vec::contains or HashSet::contains is always the winner for any sized collection. Did you have some other findings beyond the perf table above which found .binary_search to be better in some circumstances?

This was just taken from the last benchmarking run that I did: there were some other cases where binary_search was slightly faster, but it wasn't consistent. I suspect it might be faster in some cases in the 8-64 element range, but once you get beyond 16+ elements it's probably better to just use a hash set anyway.

So it may be that rather than having a "one size fit all" solution, we're better off using a heuristic in each use case (how big is this thing usually?) and choosing Vec or HashSet for that use case accordingly.

I agree. I did some follow-up benchmarking where I created this hybrid data structure and did some basic testing on it. In practice, it was not faster than just carefully choosing a Vec or FxHashSet to begin with.

  1. The microbenchmarks are helpful, but I think we'd also need to try in practice replacing some Vecs with HashSets and see what "real" benchmarks say. With this stuff, often you can't really know how perf will be affected without trying and measuring.

Again, I agree. My profiling hasn't shown Vec.contains to ever be a huge bottleneck, compared to things like memmove or free for allocating and deallocating which take up much more time.

  1. Replace CompactStr with Atom wherever possible. I suspect that'll be a win either way.

Good to know, this won't be an option in all cases since many Vec<CompactStr> are read from the configuration file, but I bet there are some small wins to be found here.

  1. We could consider trying to advance the Ident idea if we want to store strings in HashSets.

This seems like something that would be nice to have. But again, the profiles that I have seen don't show checking if a string is in a set to be a significant bottleneck for linting.

I think we could probably save more time if we just avoided doing string lookups altogether by returning earlier, using different algorithms, etc.

@Boshen
Copy link
Member

Boshen commented Oct 14, 2024

Close as not planned as I see no actionable items coming out of the investigation.

It seems like the rule of thumb is use contains given most configs is a small number of values.

@Boshen Boshen closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-linter Area - Linter C-performance Category - Solution not expected to change functional behavior, only performance
Projects
None yet
Development

No branches or pull requests

5 participants