-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CrateDB: Vector Store -- make it work using CrateDB's vector_similarity()
#31
Conversation
4f2a92c
to
565305f
Compare
sqlalchemy.func.vector_similarity( | ||
self.EmbeddingStore.embedding, | ||
# TODO: Just reference the `embedding` symbol here, don't | ||
# serialize its value prematurely. | ||
# https://github.com/crate/crate/issues/16912 | ||
# | ||
# Until that got fixed, marshal the arguments to | ||
# `vector_similarity()` manually, in order to work around | ||
# this edge case bug. We don't need to use JSON marshalling, | ||
# because Python's string representation of a list is just | ||
# right. | ||
sqlalchemy.text(str(embedding)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we need a fix of crate/crate#16912 before proceeding?
Not quite. The snippet above uses a workaround, by marshalling the embedding
argument to vector_similarity()
manually, instead of submitting it as an SQL parameter. In this spirit, we can proceed, and submit an improvement later.
If you agree with this patch, and then also with langchain-ai#27710 in its current form, we can toggle it ready to be reviewed by upstream maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are ready for a bigger review, I am fine with toggling it ready 😄
output = docsearch.similarity_search_with_relevance_scores("foo", k=3) | ||
# Original score values: 1.0, 0.9996744261675065, 0.9986996093328621 | ||
assert output == [ | ||
(Document(page_content="foo", metadata={"page": "0"}), pytest.approx(1.4, 0.1)), | ||
(Document(page_content="bar", metadata={"page": "1"}), pytest.approx(1.1, 0.1)), | ||
(Document(page_content="baz", metadata={"page": "2"}), pytest.approx(0.8, 0.1)), | ||
(Document(page_content="foo", metadata={"page": "0"}), 0.7071067811865475), | ||
(Document(page_content="bar", metadata={"page": "1"}), 0.35355339059327373), | ||
(Document(page_content="baz", metadata={"page": "2"}), 0.1414213562373095), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why we need to have a custom _euclidean_relevance_score_fn()
method: Otherwise, the scores of those documents would be assigned with 1 - x
values, i.e. they would be returned in reversed order.
search_kwargs={"k": 3, "score_threshold": 0.999}, | ||
search_kwargs={"k": 3, "score_threshold": 0.35}, # Original value: 0.999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly an anomaly that had to be applied to an input value now, in order to get the expected result here.
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)] | ||
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spot, and others, perfectly demonstrates that vector_similarity()
produces sensible values, now ranging between 0.0 and 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, _score
is unbounded, which will not work in this case
Before, the adapter used CrateDB's built-in `_score` field for ranking. Now, it uses the dedicated `vector_similarity()` function to compute the similarity between two vectors.
565305f
to
9ee8c03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we need a fix of crate/crate#16912 before proceeding?
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)] | ||
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, _score
is unbounded, which will not work in this case
About
Before, the adapter used CrateDB's built-in
_score
field for ranking, which was wrong. Usingvector_similarity()
is correct.Observations
pytest.approx()
, so the computed distance values are stable within the context of execution, for this version of CrateDB.knn_match()
function, and once more to thevector_similarity()
function. This detail can't probably be optimized due to how that feature is aligned with SQL's design. Still, we wanted to mention that it might be an obstacle and/or a performance hog, and it might be interesting to research if CrateDB could theoretically do it the same way, for example how pgvector is doing it.References
Backlog
UnsupportedOperationException: Can't handle Symbol [ParameterSymbol: $1]]
when using JOINs and parameters to an aliased and sortedvector_similarity()
together crate/crate#16912/cc @surister, @hammerhead, @ckurze