Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrateDB: Vector Store -- make it work using CrateDB's vector_similarity() #31

Merged
merged 1 commit into from
Nov 4, 2024

Conversation

amotl
Copy link

@amotl amotl commented Nov 1, 2024

About

Before, the adapter used CrateDB's built-in _score field for ranking, which was wrong. Using vector_similarity() is correct.

Observations

  • ✅ It is no longer needed to use pytest.approx(), so the computed distance values are stable within the context of execution, for this version of CrateDB.
  • ✅ In the same spirit, the constraint UserWarning: Relevance scores must be between 0 and 1 #19 seems to be resolved now.
  • ⚠️ The vector to be searched for has to be passed twice: Once to the knn_match() function, and once more to the vector_similarity() function. This detail can't probably be optimized due to how that feature is aligned with SQL's design. Still, we wanted to mention that it might be an obstacle and/or a performance hog, and it might be interesting to research if CrateDB could theoretically do it the same way, for example how pgvector is doing it.

References

Backlog

/cc @surister, @hammerhead, @ckurze

Comment on lines +330 to +341
sqlalchemy.func.vector_similarity(
self.EmbeddingStore.embedding,
# TODO: Just reference the `embedding` symbol here, don't
# serialize its value prematurely.
# https://github.com/crate/crate/issues/16912
#
# Until that got fixed, marshal the arguments to
# `vector_similarity()` manually, in order to work around
# this edge case bug. We don't need to use JSON marshalling,
# because Python's string representation of a list is just
# right.
sqlalchemy.text(str(embedding)),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@amotl amotl Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we need a fix of crate/crate#16912 before proceeding?

Not quite. The snippet above uses a workaround, by marshalling the embedding argument to vector_similarity() manually, instead of submitting it as an SQL parameter. In this spirit, we can proceed, and submit an improvement later.

If you agree with this patch, and then also with langchain-ai#27710 in its current form, we can toggle it ready to be reviewed by upstream maintainers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are ready for a bigger review, I am fine with toggling it ready 😄

Comment on lines 470 to +475
output = docsearch.similarity_search_with_relevance_scores("foo", k=3)
# Original score values: 1.0, 0.9996744261675065, 0.9986996093328621
assert output == [
(Document(page_content="foo", metadata={"page": "0"}), pytest.approx(1.4, 0.1)),
(Document(page_content="bar", metadata={"page": "1"}), pytest.approx(1.1, 0.1)),
(Document(page_content="baz", metadata={"page": "2"}), pytest.approx(0.8, 0.1)),
(Document(page_content="foo", metadata={"page": "0"}), 0.7071067811865475),
(Document(page_content="bar", metadata={"page": "1"}), 0.35355339059327373),
(Document(page_content="baz", metadata={"page": "2"}), 0.1414213562373095),
Copy link
Author

@amotl amotl Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why we need to have a custom _euclidean_relevance_score_fn() method: Otherwise, the scores of those documents would be assigned with 1 - x values, i.e. they would be returned in reversed order.

Comment on lines -498 to +494
search_kwargs={"k": 3, "score_threshold": 0.999},
search_kwargs={"k": 3, "score_threshold": 0.35}, # Original value: 0.999
Copy link
Author

@amotl amotl Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly an anomaly that had to be applied to an input value now, in order to get the expected result here.

assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)]
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spot, and others, perfectly demonstrates that vector_similarity() produces sensible values, now ranging between 0.0 and 1.0.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, _score is unbounded, which will not work in this case

Before, the adapter used CrateDB's built-in `_score` field for ranking.
Now, it uses the dedicated `vector_similarity()` function to compute the
similarity between two vectors.
@amotl amotl force-pushed the cratedb-vector_similarity branch from 565305f to 9ee8c03 Compare November 1, 2024 22:43
Copy link

@kneth kneth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we need a fix of crate/crate#16912 before proceeding?

assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)]
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, _score is unbounded, which will not work in this case

@amotl amotl requested a review from kneth November 4, 2024 12:53
@amotl amotl merged commit 476d718 into cratedb-up/1/vector-store Nov 4, 2024
@amotl amotl deleted the cratedb-vector_similarity branch November 4, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants