-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion between different vector formats #3
Comments
Once we convert to arrow, we can potentially load up in an RDBMS and compare against: https://github.com/pgvector/pgvector https://arxiv.org/abs/2308.14963 See section 4, alternatives. |
Looking into this. Currently, I was able to replicate the progress made in the colab notebook on a Python script run on the UWaterloo student CS severs, which is able to convert the index of MS MACRO TCT_ColBERT-V2-HN+ from FAISS to Arrow in about 25 sec. Next Steps:
|
Please also try to reproduce this: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-msmarco-passage-cos-dpr-distil.md Here we use a custom JSON format - see if you can convert from Faiss to this custom format. |
Created a separate repo for now to hold the utility lib code: https://github.com/mchlp/vectorutils |
Additional resource: https://arrow.apache.org/docs/python/numpy.html |
hi @mchlp are you still interested in working on this? See additional discussion in castorini/anserini#1956 |
I'm pretty busy right now, but I might be able to pick it up in a month if it's still open |
Closing in favor of #31 - which is a more accurate description of what we actually need. |
Vectors are everywhere and a core part of what our group works on - dense vectors, sparse vectors, etc.
There are a gazillion different ways of storing them:
numpy
,faiss
,csv
, serialized asjsonl
, etc. And there are custom binary formats, e.g., https://www.infoq.com/articles/apache-arrow-java/We often have a need to convert from one format to another...
It'd be awesome to build a utility that converts between different formats.
@pratyushpal started working on this: https://colab.research.google.com/drive/1kKRvC6fjY_EJSfbjFTlReYAPGa1bSIcF?usp=sharing
But it'd be great to have someone push it forward...
The text was updated successfully, but these errors were encountered: