Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow thresholding on vector and fulltext indexes for Hybrid retrievers #239

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

willtai
Copy link
Contributor

@willtai willtai commented Jan 3, 2025

Description

Allow thresholding on vector and fulltext indexes for Hybrid retrievers. Two thresholds can be provided by the user during search to determine the importance of the search results from either vector or fulltext index.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Documentation update
  • Project configuration change

Complexity

Complexity: Low

How Has This Been Tested?

  • Unit tests
  • E2E tests
  • Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

  • Documentation has been updated
  • Unit tests have been updated
  • E2E tests have been updated
  • Examples have been updated
  • New files have copyright header
  • CLA (https://neo4j.com/developer/cla/) has been signed
  • CHANGELOG.md updated if appropriate

@willtai willtai requested a review from a team as a code owner January 3, 2025 13:55
@willtai willtai force-pushed the hybrid-retriever-weight branch from 95dd2d9 to 4c1976c Compare January 3, 2025 13:56
@@ -159,6 +161,8 @@ def get_search_results(
query_text (str): The text to get the closest neighbors of.
query_vector (Optional[list[float]], optional): The vector embeddings to get the closest neighbors of. Defaults to None.
top_k (int, optional): The number of neighbors to return. Defaults to 5.
threshold_vector_index (float, optional): The minimum normalized score from the vector index to include in the top k search.
threshold_fulltext_index (float, optional): The minimum normalized score from the fulltext index to include in the top k search.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point but users might not know or understand the normalisation process. It could be worth adding something to the docs with an example which explains a bit about what's going on with these parameters as I think on their own these descriptions might not be enough

@@ -37,12 +37,14 @@ def test_hybrid_search_basic() -> None:
"YIELD node, score "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be worth adding a test to check the threshold process is working as expected? i.e. the correct scores are set to zero, etc.

Copy link
Contributor

@alexthomas93 alexthomas93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor points

@stellasia
Copy link
Contributor

I'd be interested in seeing an example of this.

Here are the points bothering me:

  • Aren't these parameters too query-dependent, since the range of normalized scores can vary a lot for each query?
  • What if most of the scores end up being 0? Then we do not have any ordering?
  • Is this compatible with the effective search ratio implemented in langchain? Typically, this ratio would bring more items in this part of the query and so change the normalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants