Running on a webdataset #8

nicolas-dufour · 2024-08-12T17:47:17Z

Hi,
Is there a simple way to run this code on a webdataset?

Thanks!

huyvvo · 2024-08-14T18:13:00Z

Hello,

This code receives as input embeddings stored in a single .npy file so how the original dataset is represented is not relevant. To run the code, you can extract embeddings for all images in the webdataset and save them in an .npy file.

Hope this answers your question!

nicolas-dufour · 2024-08-16T13:19:44Z

Thanks!
But how do you deal with memory issues then if the npy don't fit into memory for big datasets

vkhalidov · 2024-08-26T07:51:10Z

how do you deal with memory issues then if the npy don't fit into memory for big datasets

Please note that mmap_mode="r" is used when loading the embeddings (see scripts/run_distributed_kmeans.py#L51). So data loading is fast and RAM footprint is low.

Full data for a given worker gets loaded in src/distributed_kmeans_gpu.py#L96 and it's more a bottleneck for GPU memory than for RAM. However, this is parallelized across workers. So distributing sufficiently across GPUs allows to avoid GPU OOM.

huyvvo · 2024-10-08T09:56:58Z

Thanks! But how do you deal with memory issues then if the npy don't fit into memory for big datasets

For saving embeddings into a big .npy file, you can use "numpy.lib.format.open_memmap" funtionality. Another solution is to save embeddings into multiple files and use a class that imitate numpy.array, then replace line scripts/run_distributed_kmeans.py#L51 with the class' constructor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on a webdataset #8

Running on a webdataset #8

nicolas-dufour commented Aug 12, 2024

huyvvo commented Aug 14, 2024

nicolas-dufour commented Aug 16, 2024

vkhalidov commented Aug 26, 2024

huyvvo commented Oct 8, 2024 •

edited

Loading

Running on a webdataset #8

Running on a webdataset #8

Comments

nicolas-dufour commented Aug 12, 2024

huyvvo commented Aug 14, 2024

nicolas-dufour commented Aug 16, 2024

vkhalidov commented Aug 26, 2024

huyvvo commented Oct 8, 2024 • edited Loading

huyvvo commented Oct 8, 2024 •

edited

Loading