Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion between different vector formats #3

Closed
lintool opened this issue Aug 30, 2023 · 8 comments
Closed

Conversion between different vector formats #3

lintool opened this issue Aug 30, 2023 · 8 comments

Comments

@lintool
Copy link
Member

lintool commented Aug 30, 2023

Vectors are everywhere and a core part of what our group works on - dense vectors, sparse vectors, etc.

There are a gazillion different ways of storing them: numpy, faiss, csv, serialized as jsonl, etc. And there are custom binary formats, e.g., https://www.infoq.com/articles/apache-arrow-java/

We often have a need to convert from one format to another...

It'd be awesome to build a utility that converts between different formats.

@pratyushpal started working on this: https://colab.research.google.com/drive/1kKRvC6fjY_EJSfbjFTlReYAPGa1bSIcF?usp=sharing

But it'd be great to have someone push it forward...

@lintool
Copy link
Member Author

lintool commented Sep 10, 2023

Once we convert to arrow, we can potentially load up in an RDBMS and compare against: https://github.com/pgvector/pgvector

https://arxiv.org/abs/2308.14963

See section 4, alternatives.

@mchlp
Copy link

mchlp commented Sep 11, 2023

Looking into this.

Currently, I was able to replicate the progress made in the colab notebook on a Python script run on the UWaterloo student CS severs, which is able to convert the index of MS MACRO TCT_ColBERT-V2-HN+ from FAISS to Arrow in about 25 sec.

Next Steps:

  • Try loading arrow representation of vector into duckdb and querying it with SQL
  • Create a utility lib with a collection of classes / functions for converting between different formats for vectors

@lintool
Copy link
Member Author

lintool commented Sep 11, 2023

Please also try to reproduce this: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-msmarco-passage-cos-dpr-distil.md

Here we use a custom JSON format - see if you can convert from Faiss to this custom format.

@mchlp
Copy link

mchlp commented Sep 11, 2023

Created a separate repo for now to hold the utility lib code: https://github.com/mchlp/vectorutils

@lintool
Copy link
Member Author

lintool commented Sep 23, 2023

Additional resource: https://arrow.apache.org/docs/python/numpy.html
PyArrow allows converting back and forth from NumPy arrays to Arrow Arrays.

@lintool
Copy link
Member Author

lintool commented Mar 21, 2024

hi @mchlp are you still interested in working on this?

See additional discussion in castorini/anserini#1956

@mchlp
Copy link

mchlp commented Mar 23, 2024

I'm pretty busy right now, but I might be able to pick it up in a month if it's still open

@lintool
Copy link
Member Author

lintool commented Mar 23, 2024

Closing in favor of #31 - which is a more accurate description of what we actually need.

@lintool lintool closed this as completed Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants