CrateDB: Vector Store -- make it work using CrateDB's `vector_similarity()` #31

amotl · 2024-11-01T21:50:07Z

About

Before, the adapter used CrateDB's built-in _score field for ranking, which was wrong. Using vector_similarity() is correct.

Observations

✅ It is no longer needed to use pytest.approx(), so the computed distance values are stable within the context of execution, for this version of CrateDB.
✅ In the same spirit, the constraint UserWarning: Relevance scores must be between 0 and 1 #19 seems to be resolved now.
⚠️ The vector to be searched for has to be passed twice: Once to the knn_match() function, and once more to the vector_similarity() function. This detail can't probably be optimized due to how that feature is aligned with SQL's design. Still, we wanted to mention that it might be an obstacle and/or a performance hog, and it might be interesting to research if CrateDB could theoretically do it the same way, for example how pgvector is doing it.

References

Backlog

Vector Store: UnsupportedOperationException: Can't handle Symbol [ParameterSymbol: $1]] when using JOINs and parameters to an aliased and sorted vector_similarity() together crate/crate#16912

/cc @surister, @hammerhead, @ckurze

amotl · 2024-11-01T22:20:39Z

libs/community/langchain_community/vectorstores/cratedb/base.py

+                    sqlalchemy.func.vector_similarity(
+                        self.EmbeddingStore.embedding,
+                        # TODO: Just reference the `embedding` symbol here, don't
+                        #       serialize its value prematurely.
+                        #       https://github.com/crate/crate/issues/16912
+                        #
+                        # Until that got fixed, marshal the arguments to
+                        # `vector_similarity()` manually, in order to work around
+                        # this edge case bug. We don't need to use JSON marshalling,
+                        # because Python's string representation of a list is just
+                        # right.
+                        sqlalchemy.text(str(embedding)),


See also:

Vector Store: UnsupportedOperationException: Can't handle Symbol [ParameterSymbol: $1]] when using JOINs and parameters to an aliased and sorted vector_similarity() together crate/crate#16912

I suppose we need a fix of crate/crate#16912 before proceeding?

Not quite. The snippet above uses a workaround, by marshalling the embedding argument to vector_similarity() manually, instead of submitting it as an SQL parameter. In this spirit, we can proceed, and submit an improvement later.

If you agree with this patch, and then also with langchain-ai#27710 in its current form, we can toggle it ready to be reviewed by upstream maintainers.

If you are ready for a bigger review, I am fine with toggling it ready 😄

amotl · 2024-11-01T22:22:20Z

libs/community/tests/integration_tests/vectorstores/test_cratedb.py

    output = docsearch.similarity_search_with_relevance_scores("foo", k=3)
    # Original score values: 1.0, 0.9996744261675065, 0.9986996093328621
    assert output == [
-        (Document(page_content="foo", metadata={"page": "0"}), pytest.approx(1.4, 0.1)),
-        (Document(page_content="bar", metadata={"page": "1"}), pytest.approx(1.1, 0.1)),
-        (Document(page_content="baz", metadata={"page": "2"}), pytest.approx(0.8, 0.1)),
+        (Document(page_content="foo", metadata={"page": "0"}), 0.7071067811865475),
+        (Document(page_content="bar", metadata={"page": "1"}), 0.35355339059327373),
+        (Document(page_content="baz", metadata={"page": "2"}), 0.1414213562373095),


This is why we need to have a custom _euclidean_relevance_score_fn() method: Otherwise, the scores of those documents would be assigned with 1 - x values, i.e. they would be returned in reversed order.

See also: community: CrateDB: Vector Store langchain-ai/langchain#27710 (comment)

amotl · 2024-11-01T22:22:58Z

libs/community/tests/integration_tests/vectorstores/test_cratedb.py

-        search_kwargs={"k": 3, "score_threshold": 0.999},
+        search_kwargs={"k": 3, "score_threshold": 0.35},  # Original value: 0.999


That's certainly an anomaly that had to be applied to an input value now, in order to get the expected result here.

amotl · 2024-11-01T22:23:52Z

libs/community/tests/integration_tests/vectorstores/test_cratedb.py

-    assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)]
+    assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)]


This spot, and others, perfectly demonstrates that vector_similarity() produces sensible values, now ranging between 0.0 and 1.0.

Yes, _score is unbounded, which will not work in this case

Before, the adapter used CrateDB's built-in `_score` field for ranking. Now, it uses the dedicated `vector_similarity()` function to compute the similarity between two vectors.

kneth

I suppose we need a fix of crate/crate#16912 before proceeding?

kneth · 2024-11-04T11:04:55Z

libs/community/tests/integration_tests/vectorstores/test_cratedb.py

-    assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)]
+    assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)]


Yes, _score is unbounded, which will not work in this case

amotl requested review from kneth and matriv November 1, 2024 21:50

amotl force-pushed the cratedb-vector_similarity branch from 4f2a92c to 565305f Compare November 1, 2024 21:51

amotl mentioned this pull request Nov 1, 2024

community: CrateDB: Vector Store langchain-ai/langchain#27710

Closed

amotl commented Nov 1, 2024

View reviewed changes

CrateDB: Vector Store -- make it work using CrateDB's vector_similarity

9ee8c03

Before, the adapter used CrateDB's built-in `_score` field for ranking. Now, it uses the dedicated `vector_similarity()` function to compute the similarity between two vectors.

amotl force-pushed the cratedb-vector_similarity branch from 565305f to 9ee8c03 Compare November 1, 2024 22:43

kneth reviewed Nov 4, 2024

View reviewed changes

amotl requested a review from kneth November 4, 2024 12:53

kneth approved these changes Nov 4, 2024

View reviewed changes

amotl merged commit 476d718 into cratedb-up/1/vector-store Nov 4, 2024

amotl deleted the cratedb-vector_similarity branch November 4, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrateDB: Vector Store -- make it work using CrateDB's `vector_similarity()` #31

CrateDB: Vector Store -- make it work using CrateDB's `vector_similarity()` #31

amotl commented Nov 1, 2024 •

edited

Loading

amotl Nov 1, 2024

amotl Nov 4, 2024 •

edited

Loading

kneth Nov 4, 2024

amotl Nov 1, 2024 •

edited

Loading

amotl Nov 1, 2024 •

edited

Loading

amotl Nov 1, 2024

kneth Nov 4, 2024

kneth left a comment

kneth Nov 4, 2024

		search_kwargs={"k": 3, "score_threshold": 0.999},
		search_kwargs={"k": 3, "score_threshold": 0.35}, # Original value: 0.999

		assert output == [(Document(page_content="foo", metadata={"page": "0"}), 2.0)]
		assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)]

CrateDB: Vector Store -- make it work using CrateDB's vector_similarity() #31

CrateDB: Vector Store -- make it work using CrateDB's vector_similarity() #31

Conversation

amotl commented Nov 1, 2024 • edited Loading

About

Observations

References

Backlog

amotl Nov 1, 2024

Choose a reason for hiding this comment

amotl Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

kneth Nov 4, 2024

Choose a reason for hiding this comment

amotl Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

amotl Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

amotl Nov 1, 2024

Choose a reason for hiding this comment

kneth Nov 4, 2024

Choose a reason for hiding this comment

kneth left a comment

Choose a reason for hiding this comment

kneth Nov 4, 2024

Choose a reason for hiding this comment

CrateDB: Vector Store -- make it work using CrateDB's `vector_similarity()` #31

CrateDB: Vector Store -- make it work using CrateDB's `vector_similarity()` #31

amotl commented Nov 1, 2024 •

edited

Loading

amotl Nov 4, 2024 •

edited

Loading

amotl Nov 1, 2024 •

edited

Loading

amotl Nov 1, 2024 •

edited

Loading