-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anserini: replace verbose json-based vector format with more compact binary encoding #31
Comments
Here are the steps I see to accomplishing this task:
|
This is the base encoding, I am trying to first d all the data conversion within python first and then move everything to for integration testing between java and python with help of @17Melissa : |
@Panizghi good start. IIRC, @arjenpdevries 's suggestion was one |
Defiantly looking into in such case do we want mapping between the tensors or just two sets of separated tensors? |
Just two separate sets for now. |
just tried saving docids and vectors from vectors.part00.jsonl into safetensors, organized in one directory with help of @Panizghi: https://colab.research.google.com/drive/1uP5PDdplQDBp_Pd7lyh4FB5qKym-E-hR#scrollTo=x-EDkEw-yaSO |
I have a base draft for the DocumentGenerator & Collection castorini/anserini@ff5aea7. I don't think using Jython is the best approach or not however for testing it might not be the worst for now :) PS: There bit roundabout around the script before it can be tested but I thought in meantime it would be great if I can get some feedback! |
hi @Panizghi thanks for pushing this forward. Can we not introduce Jython as a new dependency and write this in pure Java? |
Hi Sure thing, Yes absolutely I am trying my best write the interpreter myself and not introducing anything new :) @17Melissa Gave me new insight that I will add it and keep you posted soon! |
Hi! The PR is open on castorini/anserini#2515 the test we tried was with nfcorpus. |
@valamuri2020 is also working on this. Should we also consider https://parquet.apache.org/ as an alternative format to safetensors? |
Parquet format implementation: castorini/anserini#2582 nfcorpus data is 17.5 MB with parquet format, 21.2 MB with safetensors |
That's pretty cool - is it due to much better compression happening in the parquet writer? |
Okay @valamuri2020 one more wrinkle. The current jsonl files you're working with were originally created from Faiss in the following pipeline by @MXueguang : Faiss -> jsonl -> parquet. This might be lossy, so I'd like you to write a converter directly from Faiss, so i.e., read from Faiss, write to parquet. Then feed into existing pipeline. The "ground truth" Faiss indexes are here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L4437 |
Currently, for HNSW indexing in Anserini, we're reading a very verbose json text-based format, which is very inefficient. We want to replace with a more efficient binary encoding.
Additional background:
safetensors
seems like the best bet.If you want to work on this task, get started by doing the BEIR regressions here: https://github.com/castorini/anserini?tab=readme-ov-file#%EF%B8%8F-end-to-end-regression-experiments
In particular, do the BGE regressions on NFcorpus, which aligns with the onboarding exercise. If your personal machine isn't big enough to run the regression, the student linux environment should be sufficient.
The text was updated successfully, but these errors were encountered: