-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876
HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876
Conversation
Thanks for your pull request! This pull request appears to follow the contribution rules. › This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this could be something useful to add?
For casual users, I very much doubt it's useful.
For advanced users... I can't say, I'm not familiar enough with the topic. I'm not sure the "similarity" value is meaningful enough for users to even be able to provide a meaningful limit, but then... I have little experience with this.
If you can find good use cases, and it's not too hard to support on Elasticsearch/OpenSearch, well... why not. But we'd have to be extra clear this is for very specific use cases.
...g/hibernate/search/integrationtest/backend/tck/search/predicate/KnnPredicateSpecificsIT.java
Outdated
Show resolved
Hide resolved
.../main/java/org/hibernate/search/backend/lucene/search/predicate/impl/LuceneKnnPredicate.java
Outdated
Show resolved
Hide resolved
Ok, this gives a rather interesting use case: https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-similarity-search Basically it's not about relevance, it's about performance. A similarity filter stops the search early if the engine can't find any vector that is reasonably similar to the provided one, avoiding unnecessary computation. In other words, we know that vectors that have a similarity lower than a given number won't affect the score that much, so we don't even look for them. So yes, I think you're right, this is a valuable feature. Please create a Jira? |
0d41677
to
766afc9
Compare
...ava/org/hibernate/search/backend/lucene/lowlevel/query/impl/VectorSimilarityFilterQuery.java
Outdated
Show resolved
Hide resolved
766afc9
to
47a71c5
Compare
e1f3275
to
7b234ea
Compare
7b234ea
to
ada4bd0
Compare
3ba160f
to
9c7958b
Compare
728deba
to
441a9d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM, just one comment.
documentation/src/main/asciidoc/public/reference/_search-dsl-predicate.adoc
Show resolved
Hide resolved
441a9d0
to
53f75d3
Compare
53f75d3
to
d19cdc3
Compare
...ava/org/hibernate/search/backend/lucene/lowlevel/query/impl/VectorSimilarityFilterQuery.java
Show resolved
Hide resolved
|
https://hibernate.atlassian.net/browse/HSEARCH-5052
Hey Yoann 😃
You've probably seen the discussion we had yesterday on Zulip here: https://infinispan.zulipchat.com/#narrow/stream/118645-infinispan/topic/vector.20search/near/407994976
I thought I'd give it a try and see what can we do to filter things out on our side; I've looked into creating custom queries and that's what I've put together. Do you think this could be something useful to add?