Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSEARCH-5052 Filter out documents with vectors below a "similarity limit" #3876

Merged
merged 1 commit into from
Jan 23, 2024

Conversation

marko-bekhta
Copy link
Member

@marko-bekhta marko-bekhta commented Dec 15, 2023

https://hibernate.atlassian.net/browse/HSEARCH-5052

Hey Yoann 😃
You've probably seen the discussion we had yesterday on Zulip here: https://infinispan.zulipchat.com/#narrow/stream/118645-infinispan/topic/vector.20search/near/407994976
I thought I'd give it a try and see what can we do to filter things out on our side; I've looked into creating custom queries and that's what I've put together. Do you think this could be something useful to add?

@hibernate-github-bot
Copy link

hibernate-github-bot bot commented Dec 15, 2023

Thanks for your pull request!

This pull request appears to follow the contribution rules.

› This message was automatically generated.

Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this could be something useful to add?

For casual users, I very much doubt it's useful.

For advanced users... I can't say, I'm not familiar enough with the topic. I'm not sure the "similarity" value is meaningful enough for users to even be able to provide a meaningful limit, but then... I have little experience with this.

If you can find good use cases, and it's not too hard to support on Elasticsearch/OpenSearch, well... why not. But we'd have to be extra clear this is for very specific use cases.

@yrodiere
Copy link
Member

If you can find good use cases, and it's not too hard to support on Elasticsearch/OpenSearch, well... why not. But we'd have to be extra clear this is for very specific use cases.

Ok, this gives a rather interesting use case: https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-similarity-search

Basically it's not about relevance, it's about performance. A similarity filter stops the search early if the engine can't find any vector that is reasonably similar to the provided one, avoiding unnecessary computation. In other words, we know that vectors that have a similarity lower than a given number won't affect the score that much, so we don't even look for them.

So yes, I think you're right, this is a valuable feature. Please create a Jira?

@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch from 0d41677 to 766afc9 Compare December 22, 2023 15:34
@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch from 766afc9 to 47a71c5 Compare December 22, 2023 15:53
@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch 2 times, most recently from e1f3275 to 7b234ea Compare January 8, 2024 17:02
@marko-bekhta marko-bekhta changed the title WIP: filter out documents with vectors below a "similarity limit" HSEARCH-5052 Filter out documents with vectors below a "similarity limit" Jan 8, 2024
@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch from 7b234ea to ada4bd0 Compare January 9, 2024 10:40
@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch 3 times, most recently from 3ba160f to 9c7958b Compare January 17, 2024 14:22
@marko-bekhta marko-bekhta marked this pull request as ready for review January 17, 2024 14:22
@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch 2 times, most recently from 728deba to 441a9d0 Compare January 17, 2024 15:30
Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM, just one comment.

@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch from 441a9d0 to 53f75d3 Compare January 18, 2024 12:38
@marko-bekhta marko-bekhta force-pushed the feat/similarity-filter-knn branch from 53f75d3 to d19cdc3 Compare January 22, 2024 09:22
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions

65.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

@yrodiere yrodiere merged commit 3494304 into hibernate:main Jan 23, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants