Add refinement of quantized vector scores with fp distance calculations #13564

jmazanec15 · 2024-07-11T15:01:38Z

Description

With quantization techniques that are compressing vectors in memory further and further, because of how much information is lost, recall is going to drop. However, with the current quantization support, we already store the full precision float vector values. With this, we could create a two phased search process that oversamples vectors from the quantized index (i.e. r*k results), and then lazily loads vectors from disk for the r*k results, re-scoring the vectors. This has been discussed directly or indirectly in a couple different places, but I figured itd make sense to create a separate issue for it:

It has been shown to work with binary compression techniques and some PQ in a few different places (https://medium.com/qdrant/hyperfast-vector-search-with-binary-quantization-865052c5c259, https://huggingface.co/blog/embedding-quantization#binary-rescoring). We also have done some experiments in OpenSearch with IVFPQ and re-scoring and its shown pretty strong promise (assuming that the disk IOPS/throughput are strong enough - see opensearch-project/k-NN#1779 (comment)).

That being said, it can provide similar benefits to the DiskANN approach.

One challenge I foresee is the approach requires the quantized ANN index to be resident in memory. However, I am unsure if loading the full precision vectors from disk will cause the page cache to evict the quantized ANN index or if the page cache will naturally adapt to the access pattern. If it is the former, we would probably need some way to pin the quantized ANN index and its vectors in memory.

In the future, it could probably be fine tuned quite a bit with access pattern optimizations and/or some kind of hierarchical quantization and re-scoring.

benwtrent · 2024-07-11T15:24:33Z

I think the API would be tricky, but I am all for this idea. The ability to "approximately score" and then do second pass to "exact score" is useful for all sorts of levels of quantization.

Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k.

Rescoring per segment would be needlessly in-efficient.

jmazanec15 · 2024-07-11T16:00:05Z

I think the API would be tricky, but I am all for this idea

Yes agree, Ill think on this a little bit. Ill start with a PoC and go from there.

Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k.

Rescoring per segment would be needlessly in-efficient.

Yes I agree. However, with this, I think we would need to refine not the top k but the top r*k and then reduce to k. Otherwise, I dont think that recall would actually be improved - just ordering might be better.

benwtrent · 2024-07-11T16:23:54Z

@jmazanec15

I think we would need to refine not the top k but the top r*k and then reduce to k

100%, I wasn't clear. Yes, over all segments, we gather some approximate scoring that is r*k total when collapsed together, and then refine & rescore the total r*k.

jmazanec15 · 2024-07-18T16:17:18Z

Makes sense thanks @benwtrent . Im working on PoC and some experiments. Didnt realize that the full-precision vectors for quantized index are exposed via getFloatVectorValues. That should make things simpler.

I was also looking at this comment for DiskANN from @kevindrosendahl (#12615 (comment)) and noticed that mlock'ing parts of the index in memory became really important in lower mem environments for performance. I dont think this is currently supported in Lucene, but will take a look.

dungba88 · 2024-11-14T08:43:47Z

Hi all, I was looking at this idea for some experimentation ideas (not mean to be intrusive to ongoing effort).

If the full sized vectors are exposed through getFloatVectorValues then it seems we can just call exactSearch over the results of approximateSearch, but this would inefficiently re-rank k * oversample * num_segments instead of just k * oversample (with some improve in recall perhaps).

However I don't know how to get the full-sized vectors without LeafContextReader. Do you have an idea how we can achieve that?

Or maybe we can just reduce oversample to account for the number of segments, e.g: oversample = oversample / leafReaderContexts.size() (which maybe differs)?

benwtrent · 2024-11-14T14:32:18Z

then it seems we can just call exactSearch over the results of approximateSearch,

Exact search will hopefully just be using the quantized values as well.

However I don't know how to get the full-sized vectors without LeafContextReader. Do you have an idea how we can achieve that?

I don't see why that would be a problem. How I would understand this is you would need to iterate the segments twice regardless.

Once to get the top k as desired with the oversampling
Apply the brute-force reranking (by passing the "exactSearch" path as that uses quantized values).

However, the query shouldn't bother attempting to rerank results that are considered "accurate".

So, the collector will likely need to be told "Hey, these are estimations", and then a second pass over the top docs there would be possible via something like if(collector.scoresAreEstimates)

dungba88 · 2024-11-14T23:28:28Z

Hi @benwtrent

Thanks for the reply. It seems I wasn't clear enough in my previous question.

Regarding exactSearch, it was based on this jmazanec15@ comment. getFloatVectorValues() is used in exactSearch(), thus I thought that it would use full-precision vectors, but based on your comment, it seems it also uses quantized vectors? As the re-ranking step needs full-precision vectors, how can I access them?

Didnt realize that the full-precision vectors for quantized index are exposed via getFloatVectorValues

Update: It seems the raw (full precision) vectors are stored in QuantizedVectorValues along with the quantized values. The vectorValue(int ord) will return the raw vector.

dungba88 · 2024-11-15T01:32:53Z

Apply the brute-force reranking (by passing the "exactSearch" path as that uses quantized values).

Maybe I misunderstood something, but I thought the idea is to re-rank with full precision vectors over the oversampled matches?

dungba88 · 2024-11-15T02:32:22Z

I also saw where my misunderstanding comes from now: getFloatVectorValues() returns QuantizedVectorValues, which has 2 methods: vectorValue(int ord) returns the raw vector while scorer(float[] target) uses the quantized vector for scoring.

jmazanec15 · 2024-11-15T16:18:36Z

Hi all, I was looking at this idea for some experimentation ideas (not mean to be intrusive to ongoing effort).

Hey @dungba88 - not intrusive at all! I ended up not having more time to work more on this.

So, the collector will likely need to be told "Hey, these are estimations", and then a second pass over the top docs there would be possible via something like if(collector.scoresAreEstimates)

I like this approach. I think it is a good way to maintain abstraction between knn codec formats and queries, where the query does not need to know any specifics about quantization done.

dungba88 · 2024-11-17T13:37:33Z

Hey @dungba88 - not intrusive at all! I ended up not having more time to work more on this.

I had a simple PoC to demonstrate this idea (test has passed as well). I created a separate Query for sake of simplicity, but also proves that Lucene already enable this idea with (almost) no change needed in core. The idea is to add a float oversample parameter in constructor, which will sample more matches and run the second pass for FP reranking. Although the second pass currently has to be done at each segment level, as I still don't know how to get the raw FP vectors without the LeafReaderContext, so I can't compute vector similarity with just the TopDocs alone.

dungba88 · 2024-11-20T06:10:11Z

Relatedly, I got to this paper today. It mainly talks about integrating RaBitQ with graph search, but the part about reducing random access intrigues me. The author designs a way to avoid explicit re-ranking step (and thus random access) by only using approximate distance for graph search while using exact distance (with raw vectors) for nearest neighbors heap. It also stores the raw vector compactly along with the quantized vectors to avoid the random access (we are using seek + readFloats currently). I haven't looked deeply into the paper though.

benwtrent · 2024-11-20T12:59:43Z

@dungba88 I left a comment on your POC. It seems to me the best abstraction is a new query that doesn't inherit from AbstractKnnVectorQuery. Instead its just a new query that requires a KNN query as its parameter. Then you can gather the quantized scored knn docs during rewrite.

dungba88 · 2024-11-21T08:09:22Z

Thanks @benwtrent for the feedback! I managed to change the abstraction to composition in the next rev. We need to create the DocAndScoreQuery twice (once during the original Query rewrite, and once during the re-ranking Query rewrite) but now we only need to rerank r * k documents instead of r * k * num_segments.

Let me know what you think. If it (almost) looks good, I can switch to create PR on the main repo, probably adding more tests.

benwtrent · 2024-11-21T14:22:31Z

@dungba88 feel free to open a PR against the main repo. I can do a fuller review once its against Lucene. Utilizing a query that wraps things via dependency injection is way better than inheritance and better fits the API.

dungba88 · 2024-11-22T01:20:16Z

Thanks @benwtrent , I have published a PR here: #14009

dungba88 · 2024-11-26T08:29:44Z

I think there are still 2 issues to address:

Prevent quantized vectors from being swapped out: Loading full-precision vectors are costly and can cause the quantized vectors to be swapped out if the OS is under memory pressure. Maybe we can use something similar to mlock if the system supports it. But I guess it can be done by the developers instead having it built-in support in the re-ranking Query.
The latency could be better. I'm still running a thorough benchmark with KnnGraphTester, but preliminary results show the re-ranking step adds quite some latency. Maybe we can execute the re-ranking per segment in parallel, or apply some optimization. Another thing is that we are running the rewrite phase and createRewrittenQuery twice: once for the main search phase and one for the re-ranking phase. Not sure how much overhead it will introduce.

jmazanec15 added the type:enhancement label Jul 11, 2024

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 11, 2024

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 11, 2024

jmazanec15 mentioned this issue Jul 22, 2024

[RFC] Two-phased Search Re-score Design opensearch-project/k-NN#1861

Closed

dungba88 mentioned this issue Nov 15, 2024

Add TwoPhaseKnnVectorQuery dungba88/lucene#29

Open

This was referenced Nov 22, 2024

Add Query for reranking KnnFloatVectorQuery #14009

Open

Can we store only quantized vectors to reduce disk footprint? #14007

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add refinement of quantized vector scores with fp distance calculations #13564

Add refinement of quantized vector scores with fp distance calculations #13564

jmazanec15 commented Jul 11, 2024

benwtrent commented Jul 11, 2024

jmazanec15 commented Jul 11, 2024

benwtrent commented Jul 11, 2024

jmazanec15 commented Jul 18, 2024

dungba88 commented Nov 14, 2024

benwtrent commented Nov 14, 2024

dungba88 commented Nov 14, 2024 •

edited

Loading

dungba88 commented Nov 15, 2024

dungba88 commented Nov 15, 2024

jmazanec15 commented Nov 15, 2024 •

edited

Loading

dungba88 commented Nov 17, 2024

dungba88 commented Nov 20, 2024

benwtrent commented Nov 20, 2024

dungba88 commented Nov 21, 2024

benwtrent commented Nov 21, 2024

dungba88 commented Nov 22, 2024

dungba88 commented Nov 26, 2024

Add refinement of quantized vector scores with fp distance calculations #13564

Add refinement of quantized vector scores with fp distance calculations #13564

Comments

jmazanec15 commented Jul 11, 2024

Description

benwtrent commented Jul 11, 2024

jmazanec15 commented Jul 11, 2024

benwtrent commented Jul 11, 2024

jmazanec15 commented Jul 18, 2024

dungba88 commented Nov 14, 2024

benwtrent commented Nov 14, 2024

dungba88 commented Nov 14, 2024 • edited Loading

dungba88 commented Nov 15, 2024

dungba88 commented Nov 15, 2024

jmazanec15 commented Nov 15, 2024 • edited Loading

dungba88 commented Nov 17, 2024

dungba88 commented Nov 20, 2024

benwtrent commented Nov 20, 2024

dungba88 commented Nov 21, 2024

benwtrent commented Nov 21, 2024

dungba88 commented Nov 22, 2024

dungba88 commented Nov 26, 2024

dungba88 commented Nov 14, 2024 •

edited

Loading

jmazanec15 commented Nov 15, 2024 •

edited

Loading