HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876

marko-bekhta · 2023-12-15T12:57:07Z

https://hibernate.atlassian.net/browse/HSEARCH-5052

Hey Yoann 😃
You've probably seen the discussion we had yesterday on Zulip here: https://infinispan.zulipchat.com/#narrow/stream/118645-infinispan/topic/vector.20search/near/407994976
I thought I'd give it a try and see what can we do to filter things out on our side; I've looked into creating custom queries and that's what I've put together. Do you think this could be something useful to add?

hibernate-github-bot · 2023-12-15T12:57:11Z

Thanks for your pull request!

This pull request appears to follow the contribution rules.

› This message was automatically generated.

yrodiere

Do you think this could be something useful to add?

For casual users, I very much doubt it's useful.

For advanced users... I can't say, I'm not familiar enough with the topic. I'm not sure the "similarity" value is meaningful enough for users to even be able to provide a meaningful limit, but then... I have little experience with this.

If you can find good use cases, and it's not too hard to support on Elasticsearch/OpenSearch, well... why not. But we'd have to be extra clear this is for very specific use cases.

...g/hibernate/search/integrationtest/backend/tck/search/predicate/KnnPredicateSpecificsIT.java

.../main/java/org/hibernate/search/backend/lucene/search/predicate/impl/LuceneKnnPredicate.java

yrodiere · 2023-12-19T09:52:46Z

If you can find good use cases, and it's not too hard to support on Elasticsearch/OpenSearch, well... why not. But we'd have to be extra clear this is for very specific use cases.

Ok, this gives a rather interesting use case: https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-similarity-search

Basically it's not about relevance, it's about performance. A similarity filter stops the search early if the engine can't find any vector that is reasonably similar to the provided one, avoiding unnecessary computation. In other words, we know that vectors that have a similarity lower than a given number won't affect the score that much, so we don't even look for them.

So yes, I think you're right, this is a valuable feature. Please create a Jira?

...ava/org/hibernate/search/backend/lucene/lowlevel/query/impl/VectorSimilarityFilterQuery.java

yrodiere

Thanks! LGTM, just one comment.

documentation/src/main/asciidoc/public/reference/_search-dsl-predicate.adoc

…mit"

...ava/org/hibernate/search/backend/lucene/lowlevel/query/impl/VectorSimilarityFilterQuery.java

sonarqubecloud · 2024-01-22T10:05:49Z

Quality Gate failed

Failed conditions

65.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

yrodiere reviewed Dec 15, 2023

View reviewed changes

...g/hibernate/search/integrationtest/backend/tck/search/predicate/KnnPredicateSpecificsIT.java Outdated Show resolved Hide resolved

.../main/java/org/hibernate/search/backend/lucene/search/predicate/impl/LuceneKnnPredicate.java Outdated Show resolved Hide resolved

marko-bekhta force-pushed the feat/similarity-filter-knn branch from 0d41677 to 766afc9 Compare December 22, 2023 15:34

marko-bekhta commented Dec 22, 2023

View reviewed changes

...ava/org/hibernate/search/backend/lucene/lowlevel/query/impl/VectorSimilarityFilterQuery.java Outdated Show resolved Hide resolved

marko-bekhta force-pushed the feat/similarity-filter-knn branch from 766afc9 to 47a71c5 Compare December 22, 2023 15:53

marko-bekhta force-pushed the feat/similarity-filter-knn branch 2 times, most recently from e1f3275 to 7b234ea Compare January 8, 2024 17:02

marko-bekhta changed the title ~~WIP: filter out documents with vectors below a "similarity limit"~~ HSEARCH-5052 Filter out documents with vectors below a "similarity limit" Jan 8, 2024

marko-bekhta force-pushed the feat/similarity-filter-knn branch from 7b234ea to ada4bd0 Compare January 9, 2024 10:40

marko-bekhta force-pushed the feat/similarity-filter-knn branch 3 times, most recently from 3ba160f to 9c7958b Compare January 17, 2024 14:22

marko-bekhta marked this pull request as ready for review January 17, 2024 14:22

marko-bekhta force-pushed the feat/similarity-filter-knn branch 2 times, most recently from 728deba to 441a9d0 Compare January 17, 2024 15:30

yrodiere approved these changes Jan 17, 2024

View reviewed changes

documentation/src/main/asciidoc/public/reference/_search-dsl-predicate.adoc Show resolved Hide resolved

marko-bekhta force-pushed the feat/similarity-filter-knn branch from 441a9d0 to 53f75d3 Compare January 18, 2024 12:38

HSEARCH-5052 Filter out documents with vectors below a "similarity li…

d19cdc3

…mit"

marko-bekhta force-pushed the feat/similarity-filter-knn branch from 53f75d3 to d19cdc3 Compare January 22, 2024 09:22

marko-bekhta commented Jan 22, 2024

View reviewed changes

...ava/org/hibernate/search/backend/lucene/lowlevel/query/impl/VectorSimilarityFilterQuery.java Show resolved Hide resolved

yrodiere approved these changes Jan 22, 2024

View reviewed changes

yrodiere merged commit 3494304 into hibernate:main Jan 23, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876

HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876

marko-bekhta commented Dec 15, 2023 •

edited

Loading

hibernate-github-bot bot commented Dec 15, 2023 •

edited

Loading

yrodiere left a comment

yrodiere commented Dec 19, 2023

yrodiere left a comment

sonarqubecloud bot commented Jan 22, 2024

HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876

HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876

Conversation

marko-bekhta commented Dec 15, 2023 • edited Loading

hibernate-github-bot bot commented Dec 15, 2023 • edited Loading

yrodiere left a comment

Choose a reason for hiding this comment

yrodiere commented Dec 19, 2023

yrodiere left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 22, 2024

Quality Gate failed

marko-bekhta commented Dec 15, 2023 •

edited

Loading

hibernate-github-bot bot commented Dec 15, 2023 •

edited

Loading