-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize top-k counting for approximate queries #160
Comments
Hmm, why is Is there any way to cluster multiple hashes into a single term during indexing? I.e. do certain subsets of hashes frequently co-occur at search time? |
Hi @mikemccand, thanks for the reply. As a side note, I've found many of your articles very helpful!
To clarify a bit, the first screenshot is of a benchmark that uses Lucene by itself, i.e. no Elasticsearch involved. The second is from inside the search thread in the Elasticsearch JVM. The self times in the first case is proportionally higher, but both are > 50% of the method runtime. The countHits method is here: https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 The rough outline of countHits is:
Perhaps it's taking a while to allocate the array-backed counter? I don't have a good sense for whether that should actually take such a significant amount of time. Even if it does, the only alternatives I can think of are even slower. I can try it out later with a dummy counter implementation that just throws away the counts to get a lower bound. Would appreciate any more input you have about this and #156 !
That would be a smart optimization, but I'm not sure there's a way to do that without having the entire dataset up-front, which doesn't fit the model of the plugin (index/update/delete vectors just like you would index/update/delete traditional docs). |
Some more notes-to-self for when I get back to this: Here are the VisualVM hotspots from running the SIFT benchmark (1M stored vectors, 10k total queries) on a local ES JVM using the Here are the VisualVM hotspots from the same configuration, but using a dummy counter that just returns the same count and maintains no internal state (i.e. no memory allocated/updated for counting): I made sure to start the sampling right when it started running queries and stop right after it finished queries. Otherwise the percentages get skewed as the search thread continues running. |
Thanks, I am glad to hear that :) The You are reusing your Your
Measuring just the iteration and no count incrementing is a good idea. Oh I see, you just concurrently did that, as I am here commenting, heh. Oddly, it looks like
Yeah, it is not simple to implement, and would require that you have some way to tell which hashes frequently co-occur during indexing. It would also increase your index size, but then make queries faster. You might also index larger and larger co-occurrence sets, so that at search time you could "factor out" the largest set(s) that can apply to that query. How large is each vector hash? |
Thanks again for digging into this a bit.
Good to know, I'll fix that.
I might already be doing this in the query constructor, here
I modeled the implementation after the
With this benchmark config, there are 100 hashes per vector, 20 bytes each. I'd say that's a pretty typical setup. |
Maybe there's a clever stopping criteria to visiting all of the terms? I started reading about MaxScore and WAND scoring. Maybe that's dead end here? |
Put together a working early-stopping solution today. Roughly the idea is:
Seems to work on the benchmark discussed previously. The times are roughly equivalent, but there's some obvious room for optimization. The method spends more time managing the heap than it does accessing lucene postings. I'm using a standard |
That sounds promising! Do you take the Be sure to test on real vectors. Testing on random data leads to random conclusions! |
I thought about this but my understanding was that Lucene is optimized to seek over the terms in sorted order? Maybe the cost of unsorted seeks over terms is trivial compared to other things here? |
I'm afraid the early-stopping method as I described it isn't going to work. Specifically, it's pretty easy to find a case where a single vector matches for multiple consecutive hash terms and fills up the heap with its incrementing match counts. You end up determining that the cutoff for match counts is an old match count for that doc and you early-stop, but then there's only one doc that exceeds that cutoff. Replacing a doc's previous count in the heap might solve this? |
You are right, Lucene would prefer that you seek the terms in Unicode order. But it allows you to NOT do so, and the performance penalty might be lower than the gains you could see by early terminating sooner? Just pure speculation ... I think you would need to test and see to know for sure. |
Hmm, I see, your top-K heap allows the same doc to be added in multiple slots?
OK, sigh :) I still think you should explore indexing "sets of commonly co-occurring hashes" if possible. If your query-time vectors are such that the same sets of hashes commonly co-occur (i.e. you can predict these common subsets offline, before indexing, with highish accuracy to future queries), the impact can be sizable. It'd make your index a bit larger, but allow simpler/faster queries at search time. |
I agree.. I'm trying to think through some ways that could work without requiring a huge sample of representative vectors up-front. |
I'm trying to make precise what advantage indexing co-ocurring hashes would have. If we assume that Lucene's retrieval speed is maxed out, the next obvious speedup is to somehow decrease the number of irrelevant docs that I'm iterating over, i.e. increase the "quality" of candidates retrieved for each term. Aggregating co-occurring hashes would definitely lead to fewer terms. Would it lead to higher-quality matches for each term? As the number of hashes in an aggregate increases, the probability of retrieving an irrelevant candidate decreases. But that's also the exact same reasoning and procedure behind the |
Oh, I was proposing a purely functionality neutral optimization, since indexing co-occurring hashes would result in fewer disjunctive terms at search-time, and should make your searches run faster? But you're right, the scoring would be impacted, unless you could e.g. make a co-occurring hash that matched 7 hashes record and count as +7 at search timed. I had not thought through the implications of changing the score ... |
Yeah - really we should be able to know precisely how many times each vector matched. It's already an approximate algorithm, so introducing more approximation for performance-sake might not be worth it. I'm thinking we've possibly hit the limit here. This effort is mostly motivated by trying to get the performance on-par with some C/C++ in-memory approximate nearest neighbors solutions. Right now for this benchmark dataset, the queries/second is 1 to 2 OOM lower than those alternatives. Of course there are other tradeoffs that make this a favorable approach in some ways But after trying out these options and getting some sanity checks on the current implementation (thanks again to @mikemccand!), I think I'm fine with the conclusion that we've hit the limit for now. |
Since one of the major bottlenecks is reading the matched doc ids from the segment, the last thing I'm tempted to look into is whether there is some construct for caching access to the postings for a particular term in a segment? I don't see any existing construct like that in the Lucene library. I could hack it together myself pretty easily using guava. But, ideally it would also bust the cache when there are appends to a segment, and I'm not sure how I'd wire it up so that the cache is busted on appends. |
I am wondering whether did you consider using top-k aggregation algorithms like the TA or NRA (see http://alumni.cs.ucr.edu/~skulhari/Top-k-Query.pdf) that avoid traversing the complete posting list (and actually counting the hits)? And how they compare with the early-termination strategy that you experimented with? Also, did you consider using the WANDScorer in Lucene? I guess that these algorithms also don't perform well in this case because the query has many terms? |
Hi @aecio . Thanks for the link. I wasn't aware of those algorithms so I'll read the slides, hopefully tonight. I looked through the WANDScorer code. I didn't really see a way to apply it here. IIRC it looked like it was designed to optimize the top hits scoring for N individual TermQuery's, rather than one query with N terms. A couple months ago I was using a BooleanQuery with a TermQuery in a should clause for each of the terms. That was 2-3x slower compared to the current implementation and re-scoring the vectors with exact similarities was a much more awkward arrangement. |
@aecio The algorithms in those slides seem worth trying. It seems that for both Fagin's Algorithm and the Threshold Algorithm, I'd need to keep an array of PostingsEnums to "step" through the docs for each of the terms. I'll probably try the TA first. It's not clear to me how I'd implement the No-Random-Access algorithm. Specifically, I'm confused on how they define the lower and upper bounds. |
Took a first pass at the Threshold Algorithm this morning: https://github.com/alexklibisz/elastiknn/blob/top-k-algos/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L51-L134 |
I'm not seeing any practical improvement from using the Threshold Algorithm. It is nice to know that I can keep an iterator for each term and step through the sorted docs, effectively doing a document-first traversal rather than a term-first. That might be useful in revisiting my original idea for early-stopping. |
I've been going through the various optimizations described in this fantastic paper: http://www.dcs.gla.ac.uk/~craigm/publications/fnt-efficient-query-processing.pdf So far I've tried the TAAT Quit and Continue described on page 39, TAAT MaxScore on page 41, and DAAT MaxScore on page 47. TAAT Quit and Continue yields a significant speedup on the Fashion MNIST dataset. Edit: this was sadly a bug in how I was evaluating recall. The change made it so that there were duplicate docs returned. Those duplicates were very good.. but a query for k=100 docs was usually only returning about 30 distinct docs. 🤦 |
Thanks for reporting the results of your experiments @alexklibisz! The paper you mentioned is indeed excellent. If you're willing to dig deeper into this, the following paper is more recent and describes new techniques not included in that paper: https://dl.acm.org/doi/10.1145/3397271.3401076 I'm not sure they would be directly applicable and improve for your use case, but they have some plots that show their algorithm becomes particularly more efficient when the number of terms in query increases. They compare with the several variants of these optimizations, including BlockMax MaxScore, BlockMax WAND, and others. WRT the previous comments, can you clarify what you mean by "The kth greatest score is consistently ~2 OOM lower than the threshold". I'm not sure what OOM means in this context. |
OOM = orders of magnitude. Not out-of-memory :) |
@aecio I appreciate your tips. I'm checking out the paper you linked. I'll also explain the "pathological parameters on the SIFT dataset" that I mentioned before. |
Thanks for the details. It all makes sense now. :) If the problem is iterating over the long postings lists, then a promising solution might be to use the BlockMax optimizations that allow you to jump whole blocks of the posting lists that have no competitive scores (as opposed to evaluating every single doc |
As I read more about the block-max strategies, I don't actually see it being a particularly useful optimization here. The reason I think this: Every vector ("document") has the exact same number of hashes ("terms"). So the term upper bound for any given term should be the same for every doc in the postings list for that term. Maybe I'm missing something, but the block-max strategies seem to be predicated on the idea that there are blocks of docs where the document upper-bound within that block is so low that you shouldn't even bother looking, so you skip the whole block. |
I thought that because each hash (term) has a numeric value (frequency) associated with it, the score upper bounds would not be the same because they would depend on the frequency value. But I may probably be missing something as well, I only have a shallow understanding of both your query and these query optimizations. |
Ah, terms do technically have an associated frequency. It's only > 1 for one of the LSH models, Permutation LSH when |
@aecio LMK if you're interested in trying out the code to try to implement some optimizations. I haven't yet documented the development setup, but it's fairly simple so I'd be happy to write up a development guide. |
Some summary from looking at various top-k pruning optimizations over the long weekend:
|
I recently watched Adrien Grand's talk at ApacheCon 2021, where he spent some time covering sorted indices. I'm thinking this might be an interesting option for speeding up this query. If I understand it correctly, this would look like:
This is also relevant re #278. I'll likely give it a try as part of implementing the big-ann-benchmarks submission. |
I benchmarked a few primitive collections libraries, to see if any of them could be used to optimize the hit counting. The libraries I tried:
The benchmarks are on this branch: https://github.com/alexklibisz/elastiknn/tree/160-primitive-collections-benchmarks, specifically in this file: https://github.com/alexklibisz/elastiknn/blob/160-primitive-collections-benchmarks/elastiknn-micro-benchmarks/src/main/scala/com/klibisz/elastiknn/vectors/HitCounterBenchmarks.scala Here are the results:
So the most promising option looks like the eclipse |
I took another pass at this today. Specifically, I tried the new-ish IntIntHashMap from Lucene 9.x to track the document counts. I implemented a simple JMH benchmark here: https://github.com/alexklibisz/elastiknn/pull/598/files#diff-c9afec758d911ce41f3ef81c96227c75c85e36465b0248bec43dccbcd20ae655 The benchmark results are promising, particularly when I simulate large segment sizes. Here are the results on an AWS EC2 r6i.8xlarge instance with a simulated segment size of 60k, 2k hits, initial counter size 1k:
The Lucene IntIntHashMap is ~10% slower than using a plain array, but should use less memory. The Eclipse IntShortHashMap is 2x faster than both of them. However, the Eclipse collections JAR is like ~10MB, which I'd prefer to avoid. So I implemented a HitCounter based on the Lucene IntIntHashMap in this PR: #598 But the benchmark results got worse across the board. Again on the same instance: So I'm gonna sit on this for now. Maybe the Lucene IntIntHashMap would have a greater effect on a larger dataset. Maybe it's worthwhile to pull in the Eclipse collections library, or maybe I can extract just that particular IntShortHashMap as a standalone file. For now I want to try a couple other ideas. |
Closing in favor of #662 |
Currently the biggest bottleneck at query time is the
countHits
method inMatchHashesAndScoreQuery
. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector).
It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?
Some specific numbers:
When running the
MatchHashesAndScoreQueryPerformanceSuite
, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in thecountHits
method:When running the
ContinuousBenchmark
with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time incountHits
:The text was updated successfully, but these errors were encountered: