-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add refinement of quantized vector scores with fp distance calculations #13564
Comments
I think the API would be tricky, but I am all for this idea. The ability to "approximately score" and then do second pass to "exact score" is useful for all sorts of levels of quantization. Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k. Rescoring per segment would be needlessly in-efficient. |
Yes agree, Ill think on this a little bit. Ill start with a PoC and go from there.
Yes I agree. However, with this, I think we would need to refine not the top k but the top r*k and then reduce to k. Otherwise, I dont think that recall would actually be improved - just ordering might be better. |
100%, I wasn't clear. Yes, over all segments, we gather some approximate scoring that is |
Makes sense thanks @benwtrent . Im working on PoC and some experiments. Didnt realize that the full-precision vectors for quantized index are exposed via getFloatVectorValues. That should make things simpler. I was also looking at this comment for DiskANN from @kevindrosendahl (#12615 (comment)) and noticed that mlock'ing parts of the index in memory became really important in lower mem environments for performance. I dont think this is currently supported in Lucene, but will take a look. |
Hi all, I was looking at this idea for some experimentation ideas (not mean to be intrusive to ongoing effort). If the full sized vectors are exposed through However I don't know how to get the full-sized vectors without LeafContextReader. Do you have an idea how we can achieve that? Or maybe we can just reduce oversample to account for the number of segments, e.g: |
Exact search will hopefully just be using the quantized values as well.
I don't see why that would be a problem. How I would understand this is you would need to iterate the segments twice regardless.
However, the query shouldn't bother attempting to rerank results that are considered "accurate". So, the collector will likely need to be told "Hey, these are estimations", and then a second pass over the top docs there would be possible via something like |
Hi @benwtrent Thanks for the reply. It seems I wasn't clear enough in my previous question. Regarding
Update: It seems the raw (full precision) vectors are stored in QuantizedVectorValues along with the quantized values. The |
Maybe I misunderstood something, but I thought the idea is to re-rank with full precision vectors over the oversampled matches? |
I also saw where my misunderstanding comes from now: |
Hey @dungba88 - not intrusive at all! I ended up not having more time to work more on this.
I like this approach. I think it is a good way to maintain abstraction between knn codec formats and queries, where the query does not need to know any specifics about quantization done. |
I had a simple PoC to demonstrate this idea (test has passed as well). I created a separate Query for sake of simplicity, but also proves that Lucene already enable this idea with (almost) no change needed in core. The idea is to add a |
Relatedly, I got to this paper today. It mainly talks about integrating RaBitQ with graph search, but the part about reducing random access intrigues me. The author designs a way to avoid explicit re-ranking step (and thus random access) by only using approximate distance for graph search while using exact distance (with raw vectors) for nearest neighbors heap. It also stores the raw vector compactly along with the quantized vectors to avoid the random access (we are using seek + readFloats currently). I haven't looked deeply into the paper though. |
@dungba88 I left a comment on your POC. It seems to me the best abstraction is a new query that doesn't inherit from AbstractKnnVectorQuery. Instead its just a new query that requires a KNN query as its parameter. Then you can gather the quantized scored knn docs during rewrite. |
Thanks @benwtrent for the feedback! I managed to change the abstraction to composition in the next rev. We need to create the DocAndScoreQuery twice (once during the original Query rewrite, and once during the re-ranking Query rewrite) but now we only need to rerank r * k documents instead of r * k * num_segments. Let me know what you think. If it (almost) looks good, I can switch to create PR on the main repo, probably adding more tests. |
@dungba88 feel free to open a PR against the main repo. I can do a fuller review once its against Lucene. Utilizing a query that wraps things via dependency injection is way better than inheritance and better fits the API. |
Thanks @benwtrent , I have published a PR here: #14009 |
I think there are still 2 issues to address:
|
Description
With quantization techniques that are compressing vectors in memory further and further, because of how much information is lost, recall is going to drop. However, with the current quantization support, we already store the full precision float vector values. With this, we could create a two phased search process that oversamples vectors from the quantized index (i.e. r*k results), and then lazily loads vectors from disk for the r*k results, re-scoring the vectors. This has been discussed directly or indirectly in a couple different places, but I figured itd make sense to create a separate issue for it:
It has been shown to work with binary compression techniques and some PQ in a few different places (https://medium.com/qdrant/hyperfast-vector-search-with-binary-quantization-865052c5c259, https://huggingface.co/blog/embedding-quantization#binary-rescoring). We also have done some experiments in OpenSearch with IVFPQ and re-scoring and its shown pretty strong promise (assuming that the disk IOPS/throughput are strong enough - see opensearch-project/k-NN#1779 (comment)).
That being said, it can provide similar benefits to the DiskANN approach.
One challenge I foresee is the approach requires the quantized ANN index to be resident in memory. However, I am unsure if loading the full precision vectors from disk will cause the page cache to evict the quantized ANN index or if the page cache will naturally adapt to the access pattern. If it is the former, we would probably need some way to pin the quantized ANN index and its vectors in memory.
In the future, it could probably be fine tuned quite a bit with access pattern optimizations and/or some kind of hierarchical quantization and re-scoring.
The text was updated successfully, but these errors were encountered: