Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add refinement of quantized vector scores with fp distance calculations #13564

Open
jmazanec15 opened this issue Jul 11, 2024 · 17 comments · May be fixed by #14009
Open

Add refinement of quantized vector scores with fp distance calculations #13564

jmazanec15 opened this issue Jul 11, 2024 · 17 comments · May be fixed by #14009

Comments

@jmazanec15
Copy link
Contributor

Description

With quantization techniques that are compressing vectors in memory further and further, because of how much information is lost, recall is going to drop. However, with the current quantization support, we already store the full precision float vector values. With this, we could create a two phased search process that oversamples vectors from the quantized index (i.e. r*k results), and then lazily loads vectors from disk for the r*k results, re-scoring the vectors. This has been discussed directly or indirectly in a couple different places, but I figured itd make sense to create a separate issue for it:

It has been shown to work with binary compression techniques and some PQ in a few different places (https://medium.com/qdrant/hyperfast-vector-search-with-binary-quantization-865052c5c259, https://huggingface.co/blog/embedding-quantization#binary-rescoring). We also have done some experiments in OpenSearch with IVFPQ and re-scoring and its shown pretty strong promise (assuming that the disk IOPS/throughput are strong enough - see opensearch-project/k-NN#1779 (comment)).

That being said, it can provide similar benefits to the DiskANN approach.

One challenge I foresee is the approach requires the quantized ANN index to be resident in memory. However, I am unsure if loading the full precision vectors from disk will cause the page cache to evict the quantized ANN index or if the page cache will naturally adapt to the access pattern. If it is the former, we would probably need some way to pin the quantized ANN index and its vectors in memory.

In the future, it could probably be fine tuned quite a bit with access pattern optimizations and/or some kind of hierarchical quantization and re-scoring.

@benwtrent
Copy link
Member

I think the API would be tricky, but I am all for this idea. The ability to "approximately score" and then do second pass to "exact score" is useful for all sorts of levels of quantization.

Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k.

Rescoring per segment would be needlessly in-efficient.

@jmazanec15
Copy link
Contributor Author

I think the API would be tricky, but I am all for this idea

Yes agree, Ill think on this a little bit. Ill start with a PoC and go from there.

Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k.

Rescoring per segment would be needlessly in-efficient.

Yes I agree. However, with this, I think we would need to refine not the top k but the top r*k and then reduce to k. Otherwise, I dont think that recall would actually be improved - just ordering might be better.

@benwtrent
Copy link
Member

@jmazanec15

I think we would need to refine not the top k but the top r*k and then reduce to k

100%, I wasn't clear. Yes, over all segments, we gather some approximate scoring that is r*k total when collapsed together, and then refine & rescore the total r*k.

@jmazanec15
Copy link
Contributor Author

Makes sense thanks @benwtrent . Im working on PoC and some experiments. Didnt realize that the full-precision vectors for quantized index are exposed via getFloatVectorValues. That should make things simpler.

I was also looking at this comment for DiskANN from @kevindrosendahl (#12615 (comment)) and noticed that mlock'ing parts of the index in memory became really important in lower mem environments for performance. I dont think this is currently supported in Lucene, but will take a look.

@dungba88
Copy link
Contributor

Hi all, I was looking at this idea for some experimentation ideas (not mean to be intrusive to ongoing effort).

If the full sized vectors are exposed through getFloatVectorValues then it seems we can just call exactSearch over the results of approximateSearch, but this would inefficiently re-rank k * oversample * num_segments instead of just k * oversample (with some improve in recall perhaps).

However I don't know how to get the full-sized vectors without LeafContextReader. Do you have an idea how we can achieve that?

Or maybe we can just reduce oversample to account for the number of segments, e.g: oversample = oversample / leafReaderContexts.size() (which maybe differs)?

@benwtrent
Copy link
Member

then it seems we can just call exactSearch over the results of approximateSearch,

Exact search will hopefully just be using the quantized values as well.

However I don't know how to get the full-sized vectors without LeafContextReader. Do you have an idea how we can achieve that?

I don't see why that would be a problem. How I would understand this is you would need to iterate the segments twice regardless.

  • Once to get the top k as desired with the oversampling
  • Apply the brute-force reranking (by passing the "exactSearch" path as that uses quantized values).

However, the query shouldn't bother attempting to rerank results that are considered "accurate".

So, the collector will likely need to be told "Hey, these are estimations", and then a second pass over the top docs there would be possible via something like if(collector.scoresAreEstimates)

@dungba88
Copy link
Contributor

dungba88 commented Nov 14, 2024

Hi @benwtrent

Thanks for the reply. It seems I wasn't clear enough in my previous question.

Regarding exactSearch, it was based on this jmazanec15@ comment. getFloatVectorValues() is used in exactSearch(), thus I thought that it would use full-precision vectors, but based on your comment, it seems it also uses quantized vectors? As the re-ranking step needs full-precision vectors, how can I access them?

Didnt realize that the full-precision vectors for quantized index are exposed via getFloatVectorValues

Update: It seems the raw (full precision) vectors are stored in QuantizedVectorValues along with the quantized values. The vectorValue(int ord) will return the raw vector.

@dungba88
Copy link
Contributor

Apply the brute-force reranking (by passing the "exactSearch" path as that uses quantized values).

Maybe I misunderstood something, but I thought the idea is to re-rank with full precision vectors over the oversampled matches?

@dungba88
Copy link
Contributor

I also saw where my misunderstanding comes from now: getFloatVectorValues() returns QuantizedVectorValues, which has 2 methods: vectorValue(int ord) returns the raw vector while scorer(float[] target) uses the quantized vector for scoring.

@jmazanec15
Copy link
Contributor Author

jmazanec15 commented Nov 15, 2024

Hi all, I was looking at this idea for some experimentation ideas (not mean to be intrusive to ongoing effort).

Hey @dungba88 - not intrusive at all! I ended up not having more time to work more on this.

So, the collector will likely need to be told "Hey, these are estimations", and then a second pass over the top docs there would be possible via something like if(collector.scoresAreEstimates)

I like this approach. I think it is a good way to maintain abstraction between knn codec formats and queries, where the query does not need to know any specifics about quantization done.

@dungba88
Copy link
Contributor

Hey @dungba88 - not intrusive at all! I ended up not having more time to work more on this.

I had a simple PoC to demonstrate this idea (test has passed as well). I created a separate Query for sake of simplicity, but also proves that Lucene already enable this idea with (almost) no change needed in core. The idea is to add a float oversample parameter in constructor, which will sample more matches and run the second pass for FP reranking. Although the second pass currently has to be done at each segment level, as I still don't know how to get the raw FP vectors without the LeafReaderContext, so I can't compute vector similarity with just the TopDocs alone.

@dungba88
Copy link
Contributor

Relatedly, I got to this paper today. It mainly talks about integrating RaBitQ with graph search, but the part about reducing random access intrigues me. The author designs a way to avoid explicit re-ranking step (and thus random access) by only using approximate distance for graph search while using exact distance (with raw vectors) for nearest neighbors heap. It also stores the raw vector compactly along with the quantized vectors to avoid the random access (we are using seek + readFloats currently). I haven't looked deeply into the paper though.

@benwtrent
Copy link
Member

@dungba88 I left a comment on your POC. It seems to me the best abstraction is a new query that doesn't inherit from AbstractKnnVectorQuery. Instead its just a new query that requires a KNN query as its parameter. Then you can gather the quantized scored knn docs during rewrite.

@dungba88
Copy link
Contributor

Thanks @benwtrent for the feedback! I managed to change the abstraction to composition in the next rev. We need to create the DocAndScoreQuery twice (once during the original Query rewrite, and once during the re-ranking Query rewrite) but now we only need to rerank r * k documents instead of r * k * num_segments.

Let me know what you think. If it (almost) looks good, I can switch to create PR on the main repo, probably adding more tests.

@benwtrent
Copy link
Member

@dungba88 feel free to open a PR against the main repo. I can do a fuller review once its against Lucene. Utilizing a query that wraps things via dependency injection is way better than inheritance and better fits the API.

@dungba88
Copy link
Contributor

Thanks @benwtrent , I have published a PR here: #14009

@dungba88
Copy link
Contributor

I think there are still 2 issues to address:

  • Prevent quantized vectors from being swapped out: Loading full-precision vectors are costly and can cause the quantized vectors to be swapped out if the OS is under memory pressure. Maybe we can use something similar to mlock if the system supports it. But I guess it can be done by the developers instead having it built-in support in the re-ranking Query.
  • The latency could be better. I'm still running a thorough benchmark with KnnGraphTester, but preliminary results show the re-ranking step adds quite some latency. Maybe we can execute the re-ranking per segment in parallel, or apply some optimization. Another thing is that we are running the rewrite phase and createRewrittenQuery twice: once for the main search phase and one for the re-ranking phase. Not sure how much overhead it will introduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants