-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Lucene based kNN search support in core OpenSearch #3545
Comments
I like having choices, and appreciate having to think about memory/performance tradeoffs, so +1 to exposing a Lucene implementation. Regarding the questions, I can speak to (2). We had implemented brute-force k-nn with euclidean distance, then LSH, then nn-descent at Artsy. The distance measure works differently for different types of data (the nn-descent paper suggests using cosine similarity for text, l2 distance of color histograms for images, for example), so I definitely think the ability to choose is important and L2 is definitely not a good default metric in all cases. |
Thank you for bringing the question of default metric @dblock. Our suggestion for L2 as default one is based mainly on fact that its default in Lucene HNSW implementation. We may need to conduct additional benchmark for different metrics supported by Lucene (in addition to L2 it supports cosine similarity and dot product) and make more data driven decision on default metric value. |
I think as long as users can choose the metric, the default should always match Lucene's default. 💯 . |
Im not sure on default:
I think it should be:
If a user wants to skip building knn specific index structures (which can be expensive to build) for cases like painless scripting, I think they should be able to set "knn.enabled = false" and have it default to true. |
Yes, the format without a knn structure is the default. For this case system will skip creation of knn index structures. I was referring to one with empty knn as to a minimal form that supports knn search. Let me update the doc to highlight this |
My problem is more around having users submit an empty map for default knn:
My feeling is, use knn default when "knn" is not specified:
If a user does not want to build knn structures, do something like:
|
I have a doubt regarding such approach. I think we're making an assumption that if knn structure not specified then we do not want to perform knn on that data, but example with painless script shows that it's not always the case. Having |
@martin-gaievski we may consider different name for field type, it seems like Elasticsearch had introduced it (as xpack feature) [1], I am afraid we call for a trouble .... :( [1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/dense-vector.html |
I see where you're coming from. We actually wanted to have a name that is recognizable by community. We've reviewed multiple options and |
No objection from my side, we have similar dilemma with [1] #1018 |
We want to step back a little and make additional evaluation of possible solutions. In particular, functionality of |
The new section “Revised Approach” has been added, mainly it suggests exposure of Lucene HNSW implementation through existing knn_vector field and avoids creation of new dense_vector field |
I'm happy to see this issue. I think it's one of long-term strategic importance for OpenSearch, given the current momentum of neural search in industry and academia. I think it's perfectly acceptable to introduce the functionality as part of the k-NN plugin. |
Big thanks to all participants, we're done collecting feedback and release version is practically finished. Please communicate suggestions and bug reports via github issues. |
@martin-gaievski It isn't mentioned in the updated proposal but based off the current documentation for the knn-plugin found here it looks like the plan to support "pre-filtering" that was in the initial proposal was removed. Is this a documentation issue or is it still the case that all filtering is only done on the top k documents after they are retrieved? |
We are still planning to add support for filtering on top of Lucene knn engine, it's planned currently for 2.4. The way it's implemented in Lucene makes it rather a hybrid approach of both pre and post filtering, but in any case top k elements are guaranteed. This is due to details of pre-filtering implementation, Lucene keeps traversing HNSW graphs and adding vectors to result until it finds top k vectors. |
@martin-gaievski this is great! How are you currently benchmarking kNN search? Also, any plans for OpenSearch-Benchmark to add kNN support? |
Hi, |
- Using Lucene engine implementation for vector search because of reasons (opensearch-project/OpenSearch#3545). => **This option is only available for OpenSearch >= 2.2** - Remove word cloud computation and indexing with a multilingual analyzer (reduce the indexing resources). The better way to compute this word/term cloud will be using a list of provided tokens or tokenizer instead of our custom one. Co-authored-by: Francisco Aranda <[email protected]>
This document proposes integrating Lucene 9.0’s recent addition of Approximate Nearest Neighbor search into OpenSearch.
Problem Statement
k-NN Search has gained popularity over recent years with the growing use of ML. A few applications include recommendation systems, NLP services, and computer vision.
A brute force nearest neighbor search consists of calculating the distance from a query point to every point in the index and returning the “k” index points that have the smallest distance. A point can be a dense vector, a sparse vector or a binary code; however, in this document, we are only interested in dense vectors. The major drawback of this algorithm is that it is unable to search indices with millions or billions of vectors efficiently. To address this, a new problem was formulated: Approximate Nearest Neighbor search or ANN for short. The general idea is to sacrifice search quality and index latency to drastically improve search latency. Several algorithms have been created, including Hierarchical Small Worlds (HNSW), Inverted File System (IVF), and Locality Sensitive Hashing (LSH).
Several different libraries exist that implement these algorithms, including nmslib and faiss. The libraries are similar to Lucene in that they build indices that can be searched and serialized — however, they are built specifically for vector search. The k-NN plugin integrates these libraries into OpenSearch so that OpenSearch users can perform ANN search workloads.
The approach the plugin takes has a few drawbacks:
Proposed Solution
TLDR: Build a new field type and query type in OpenSearch to support ANN search through Lucene.
Lucene 9’s introduction of Approximate Nearest Neighbor Support
In the 9.0 release, Lucene added support for building indices with dense vectors and running ANN search on them. In order to create indices with vectors, a new Codec format was introduced, KnnVectorsFormat. This format is used to encode and decode numeric vector fields and execute Nearest Neighbor search. In Lucene 9, the default format used is Lucene90HnswVectorsFormat. This format implements the HNSW algorithm for ANN search. During indexing, this format will create 2 separate segment files: one for the vectors and one for the HNSW graph structure, allowing the vectors to exist off heap.
Lucene’s implementation of HNSW takes two parameters at index time: max_connections and beam_width. Max_connections sets a ceiling on the number of connections a node in the graph can have. Beam_width is a parameter that controls the candidate list for neighbors of a new node added to the graph. These parameters map to m and ef_construction parameters from the original paper. These parameters are set in Lucene91HnswVectorsFormat’s constructor. Codec supports KnnVectorFormat at the field level, and this allows setting these parameters at the field level.
For search, a new query was introduced, the KnnVectorQuery. This query contains the query vector as well as k, the number of results a query on the HNSW graph should return. For search, the HNSW algorithm has one parameter, ef_search. In Lucene’s implementation this value is hardcoded to k.
Integration with OpenSearch
In order to integrate this new functionality into OpenSearch, we need to introduce a new field type, dense_vector. This field type would allow users to index floating point arrays of uniform length into OpenSearch and build the index structures needed to support ANN search with Lucene.
Because initially we are going to provide support for Lucene HNSW implementation format above can be simplified to a following form
For provided request example system will use Lucene HNSW with L2 metric, max_connections equals 16 and beam_width equals 100.
The minimal default mapping for dense_vector does not include knn structure and is shown below. For such definition the proposed knn based on Lucene HNSW will not be enabled and related graph data structures are not created; only vector values are stored.
In addition to the field type, we are adding support a new query type naming knn-query.
Lucene implementation of HNSW and knn vector type has certain limitations that we have to accept with such integration - maximum number for dimensions for knn vector is 1024, this implies same limitation for new dense_vector field type.
Performance
Team has benchmarked prototype of solution based on Lucene 9.1 integration with OpenSearch 2.0.
For benchmark algorithm parameters ef_construction and beam_width set to 512, and max_connections and m varied in range from 4 to 96.
Benchmark results showed that Lucene 9.1 solution is comparable with existing k-NN hnsw implementation based on nmslib with certain tradeoffs: Lucene 9.1 solution cannot reach high recall values close to 1.0, but when recall values are comparable with existing k-NN plugin Lucene 9.1 solution consumes less resources and has better query latencies.
We have observed some trends in benchmark results - for the same high recall values memory that the k-NN index takes is higher than required for Lucene implementation. Query latencies and memory allocated per query are higher for k-NN implementation.
Solution shows itself as stable under constant query load that is typical for k-NN plugin.
Obtained metrics are rough as they are based on POC code of Lucene 9.1 integration with OpenSearch 2.0 without any optimizations. For production ready version we’re planning to have more accurate metrics.
Pre-filtering feature
Lucene implementation of HNSW gives additional benefit of having pre-filtering on queries. Current implementation in k-NN provides post-filtering - first documents are selected by applying HNSW algorithm and then we apply filter on top of HNSW results. In general results are not exactly correct due to sequence in which different processing steps are applied.
Lucene 9.1 applies filter on live-docs and then selects k closest results by applying HNSW algorithm.
At the moment we’re planning to deliver pre-filtering feature “as is”, no changes/additions to what is implemented in Lucene.
Impact on k-NN plugin
Existing knn_vector field supported by k-NN plugin will be independent of the dense_vector field introduced in core OpenSearch. This change will not break any functionality of the plugin, and customers will be able to upgrade their existing indices created in earlier k-NN versions taking into account limitation on maximum vector dimensions.
From a migration standpoint, users would be able to reindex their knn_vector indices into indices that use the new dense_vector field type and vice versa.
In order to share the knn query type with OpenSearch, the plugin’s Codec would be required to implement the new KnnVectorsFormat. More details on this will be available in the plugins repository.
Revised Approach
Overview
After some offline discussions, we have decided to propose a pivot to our plan to support Lucene k-NN.
Introducing a new data type , dense_vector, and query type, knn_query, when we already have the data type, knn_vector, and the query type, knn, in the plugin will be a source of confusion for users. Having two data types with very similar end user functionality will raise questions such as “Why aren’t they the same?”, “When should I use what?”, etc. Instead of introducing dense_vector, we are proposing to capture the Lucene functionality inside the existing knn_vector field type in the plugin. Then, we would use the existing knn query type as well. The interface would be consistent with our current knn_vector.
As a consequence of this change, because only one field mapping parser can be registered to a field name (as well as query parser to query name), we would need to move the implementation to the k-NN plugin. We would still target the 2.2 release.
Using existing knn_vector field type instead of new dense_vector
Original Proposal Limitations
User Experience
The main limitation to this approach is the confusing user experience that it creates. We would have 2 different data types that do very similar things: knn_vector and dense_vector. Each would support indexing dense vectors and running approximate Nearest Neighbor searches. In addition to this, we would have 2 different query types that would do a similar thing: knn_query and knn.
The decision to make a new data type and query type was made originally to separate from architecture specific nature of the k-NN plugin. However, on closer examination, a user could use the k-NN plugin without the native libraries on a different platform as long as we update plugin build infrastructure to decouple platform dependent components from platform independent components.
Updated Proposal
To solve the problem of user confusion, we would build Lucene ANN support into our current knn_vector interface:
The interface would stay the same except for adding lucene as a new engine to support. We have evaluated the technical feasibility of this approach and deemed it feasible.
With this, the query would stay the same as our current query for the plugin:
Note that in the future, we could update these interfaces to how we see fit. For more information on current interface, please take a look at the k-NN documentation.
Updated Proposal Limitations
Initial Coupling to Architecture Dependent Nature of Plugin
One limitation with the updated proposal is that it couples the Lucene ANN platform agnostic functionality with the plugin’s platform dependence.
However, a user that would want to run the plugin on a non-plugin supported platform, could install the plugin zip without the libraries and use the Lucene engine. As it stands right now, if they tried to use nmslib or faiss, the process would crash.
Overall, we would solve this problem by decoupling the JNI library from the plugin and providing a graceful failure if a user tries to use the JNI components on an unsupported platform. Additionally, we would decouple the build process as well.
Difficulty integrating with other engine plugins
Another limitation lies in a more technical nuance. From the user perspective, the consequence is that they could not create a Lucene k-NN index with another Engine plugin feature like CCR.
This is because, we have to pass the HNSW parameters to the Lucene KnnVectorsFormat during construction. From a plugin, we have no way to add this to the default Codec that OpenSearch will use. Therefore, we will need to use our own custom codec. But, two custom codecs cannot be used at the same time for an index.
In the short term, this will mean users cannot use the Lucene KNN with another feature like CCR. This limitation is currently present for the plugin as well.
Related github issues:
Feedback Requested
Any general comments about the overall direction are welcome. Some specific questions:
Dot Product
andCosine Similarity
.The text was updated successfully, but these errors were encountered: