Cache hf images from memory fs #430

dberenbaum · 2024-09-11T15:24:27Z

One more attempt to explore caching images from huggingface datasets. In this approach, each image is saved in a MemoryFileSystem and then cached as a regular image file, which means cache is always on for these objects.

Here is the performance for an example:

>>> from datachain import C, DataChain
>>> chain = DataChain.from_hf("lmms-lab/COCO-Caption2017", split="val")
>>> chain = chain.save("coco")
Processed: 1 rows [00:00, 966.21 rows/s]
Generated: 1 rows [00:00, 1028.77 rows/s]
Parsed Hugging Face dataset: 5000 rows [02:29, 33.40 rows/s]
Processed: 1 rows [02:29, 149.79s/ rows]02:29, 70.66 rows/s]
Generated: 5000 rows [02:27, 33.94 rows/s]
Saving: 5000 rows [00:00, 40493.06 rows/s]
Cleanup: 2 tables [00:00, 124.91 tables/s]
>>> images = list(chain.select("image").to_pytorch())
Saving: 5000 rows [00:00, 49880.65 rows/s]
Parsed PyTorch dataset for rank=0 worker: 5000 rows [00:16, 295.62 rows/s]

Performance during chain.save() may be hurt by not having an async implementation for MemoryFileSystem. It may be possible to further optimize how the images are saved to the cache.

Here is the same on the main branch for comparison:

>>> from datachain import C, DataChain
>>> chain = DataChain.from_hf("lmms-lab/COCO-Caption2017", split="val")
>>> chain = chain.save("coco")
Processed: 1 rows [00:00, 735.84 rows/s]
Generated: 1 rows [00:00, 765.80 rows/s]
Parsed Hugging Face dataset: 5000 rows [02:31, 32.97 rows/s]
Processed: 1 rows [02:32, 152.83s/ rows]02:31, 81.97 rows/s]
Generated: 5000 rows [02:30, 33.24 rows/s]
Saving: 5000 rows [00:01, 4584.52 rows/s]
Cleanup: 2 tables [00:00,  2.88 tables/s]
>>> images = list(chain.select("image").to_pytorch())
Saving: 5000 rows [00:01, 3841.78 rows/s]
Parsed PyTorch dataset for rank=0 worker: 5000 rows [00:15, 329.91 rows/s]

Performance looks pretty similar, so it's up to a preference between saving the images in the warehouse db or in the file cache.

cloudflare-workers-and-pages · 2024-09-11T15:26:40Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`91b7e81`
Status:	✅ Deploy successful!
Preview URL:	https://c6d94eb6.datachain-documentation.pages.dev
Branch Preview URL:	https://hf-memfs.datachain-documentation.pages.dev

View logs

shcheklein · 2024-09-11T23:49:14Z

Yep, thanks @dberenbaum . It's hard for me to tell tbh if one is better. If we don't see the difference (btw it might be a bit different performance with SaaS?) - then I would keep it as is for now (data in DB) and probably we should leave the data for training optimizations downstream (there can be a variety of those - special FS, bundling, etc?). WDYT?

dberenbaum · 2024-09-12T19:11:29Z

Sounds good @shcheklein. This was more to create a record in case there is ever a desire to come back to it. Closing for now.

Dave Berenbaum added 4 commits September 11, 2024 10:09

cache hf images from memory fs

c590e64

show progress bar for pytorch conversion

c4ea912

Merge branch 'pytorch_tqdm' into hf_memfs

bf2138b

refactor

62d025c

dberenbaum requested a review from shcheklein September 11, 2024 15:24

Merge branch 'main' into hf_memfs

91b7e81

dberenbaum closed this Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache hf images from memory fs #430

Cache hf images from memory fs #430

dberenbaum commented Sep 11, 2024

cloudflare-workers-and-pages bot commented Sep 11, 2024 •

edited

Loading

shcheklein commented Sep 11, 2024

dberenbaum commented Sep 12, 2024

Cache hf images from memory fs #430

Cache hf images from memory fs #430

Conversation

dberenbaum commented Sep 11, 2024

cloudflare-workers-and-pages bot commented Sep 11, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

shcheklein commented Sep 11, 2024

dberenbaum commented Sep 12, 2024

cloudflare-workers-and-pages bot commented Sep 11, 2024 •

edited

Loading