Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on a webdataset #8

Open
nicolas-dufour opened this issue Aug 12, 2024 · 4 comments
Open

Running on a webdataset #8

nicolas-dufour opened this issue Aug 12, 2024 · 4 comments

Comments

@nicolas-dufour
Copy link

Hi,
Is there a simple way to run this code on a webdataset?

Thanks!

@huyvvo
Copy link
Contributor

huyvvo commented Aug 14, 2024

Hello,

This code receives as input embeddings stored in a single .npy file so how the original dataset is represented is not relevant. To run the code, you can extract embeddings for all images in the webdataset and save them in an .npy file.

Hope this answers your question!

@nicolas-dufour
Copy link
Author

Thanks!
But how do you deal with memory issues then if the npy don't fit into memory for big datasets

@vkhalidov
Copy link

how do you deal with memory issues then if the npy don't fit into memory for big datasets

Please note that mmap_mode="r" is used when loading the embeddings (see scripts/run_distributed_kmeans.py#L51). So data loading is fast and RAM footprint is low.

Full data for a given worker gets loaded in src/distributed_kmeans_gpu.py#L96 and it's more a bottleneck for GPU memory than for RAM. However, this is parallelized across workers. So distributing sufficiently across GPUs allows to avoid GPU OOM.

@huyvvo
Copy link
Contributor

huyvvo commented Oct 8, 2024

Thanks! But how do you deal with memory issues then if the npy don't fit into memory for big datasets

For saving embeddings into a big .npy file, you can use "numpy.lib.format.open_memmap" funtionality. Another solution is to save embeddings into multiple files and use a class that imitate numpy.array, then replace line scripts/run_distributed_kmeans.py#L51 with the class' constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants