-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on a webdataset #8
Comments
Hello, This code receives as input embeddings stored in a single .npy file so how the original dataset is represented is not relevant. To run the code, you can extract embeddings for all images in the webdataset and save them in an .npy file. Hope this answers your question! |
Thanks! |
Please note that Full data for a given worker gets loaded in src/distributed_kmeans_gpu.py#L96 and it's more a bottleneck for GPU memory than for RAM. However, this is parallelized across workers. So distributing sufficiently across GPUs allows to avoid GPU OOM. |
For saving embeddings into a big .npy file, you can use "numpy.lib.format.open_memmap" funtionality. Another solution is to save embeddings into multiple files and use a class that imitate numpy.array, then replace line scripts/run_distributed_kmeans.py#L51 with the class' constructor. |
Hi,
Is there a simple way to run this code on a webdataset?
Thanks!
The text was updated successfully, but these errors were encountered: