-
-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linter: Optimize string lookups with better data structure #6408
Comments
One question I have related to this is: should we have a custom data structure in
Perhaps we could add a new type named someting like
|
To help us make better decisions here, I did some microbenchmarking on my laptop. I compared doing The results are approximately as follows:
The results here vary somewhat, but after running this a number of times (several billion iterations on each benchmark), my general takeaways are this:
This suggests to me that maybe we need some sort of hybrid data structure that automatically reallocates/reorganizes as the size of the array grows?
|
These are surprising results, and extremely informative. Most configs have only a small number of elements. Excellent stuff @camchenry |
Great! Maybe you are interested in this topic as well @overlookmotel cc. |
I purposely used |
@camchenry Thanks very much for this extremely thorough investigation. Very interesting. A few thoughts and responses:
Can you give some examples? Everything I write below is in the abstract, but I'm not actually sure what we're talking about in practice.
I didn't draw this conclusion from your bench results above. It looks more like that
Yes. The advantage of
Less allocations is probably the most significant of these advantages. But the downside is that every time you read a So whether Personally, I think it's ideal to avoid
This is an attractive idea, and in some cases may be a good choice. However, be aware that this structure would have its own downsides:
So it may be that rather than having a "one size fit all" solution, we're better off using a heuristic in each use case (how big is this thing usually?) and choosing One other thing to bear in mind is that when a A further consideration is that I think at some point, we'll likely introduce a new TLDRSorry I've written an essay. Your investigations raise some pretty fundamental and complicated questions! What can we do in practice?
|
I should have included this in the original issue, I mainly filed this in response to some PR feedback from @DonIsaac that I wanted to do more investigation on. Here are some examples where we are using
This was just taken from the last benchmarking run that I did: there were some other cases where
I agree. I did some follow-up benchmarking where I created this hybrid data structure and did some basic testing on it. In practice, it was not faster than just carefully choosing a
Again, I agree. My profiling hasn't shown
Good to know, this won't be an option in all cases since many
This seems like something that would be nice to have. But again, the profiles that I have seen don't show checking if a string is in a set to be a significant bottleneck for linting. I think we could probably save more time if we just avoided doing string lookups altogether by returning earlier, using different algorithms, etc. |
Close as not planned as I see no actionable items coming out of the investigation. It seems like the rule of thumb is use |
There are many cases in the linter where store
Vec<String>
,Vec<&str>
, orVec<CompactStr>
and then later check ifvec.contains(something)
. At runtime, this takesO(n)
to find if the given value is in the array. However, we can do better by replacingVec
with a different data structure.FxHashSet
If we replace
Vec
withFxHashSet
, we get the benefit of lookups takingO(1)
time, at the cost of inserts taking longer due to hashing.Sorted
Vec
If we sort the
Vec
after inserting values, then we can usebinary_search
to look up if the element is in the array, which takeO(log(n))
time. The sorting process itself is additional overhead ofO(n log(n))
, but might be negligible if elements are not inserted to the array afterwards. Otherwise, if elements are inserted later, it is probably faster to use a set.The ideal data structure here probably depends on the usecase. For re-inserting values, a
HashSet
might be better. For a smaller set of values that don't change, a sortedVec
might be better.The text was updated successfully, but these errors were encountered: