Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Multi-Vector Similarity Function #13991

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

vigyasharma
Copy link
Contributor

This is a small first change towards adding support for multi-vectors. We start with adding a MultiVectorSimilarityFunction that can handle (late) interaction across multiple vector values.

This is the first of a series of splits for the larger prototype PR #13525

@vigyasharma
Copy link
Contributor Author

vigyasharma commented Nov 13, 2024

I am thinking we can leverage the NONE aggregation (in #13525) for non-ColBERT passage vector use-cases, by making each graph node correspond to a single value in the multi-vector i.e. index time aggregation becomes "none". The resultant graph would be similar to what we construct with parent-child docs today, while flat storage with multi-vectors could allow for aggregated similarity checks at query time. This could help with recall while making mutli-vector usage easier to use (no overquery or index time joins).

It'll need some work: a mechanism to address each vector value directly, and corresponding changes in VectorValues. I'm thinking maybe an "ordinal" for the multi-vector, and a "sub-ordinal" for values within the multi-vector. Both ints can be packed into a long for node value?

Since I haven't chalked out all the details yet, I decided to remove the NONE aggregation for now, and focus first on the ColBERT use case. Will go with the isMultiVector flag in FieldInfos to identify multi-vector storage requirements (and keep NONE aggregation free to this lever).

cc: @cpoerschke , @benwtrent

for (float[] o : outerList) {
float maxSim = Float.MIN_VALUE;
for (float[] i : innerList) {
maxSim = Float.max(maxSim, vectorSimilarityFunction.compare(o, i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add another compare method with start and end indexes for both inner and outer, I guess we won't need to copy the array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but it needs to go all the way down to VectorUtilSupport, which I think should be a PR of its own.

}

@Override
public float aggregate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so unfortunate that Java doesn't support generic primitive array and has to have duplicate code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants