Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache hf images from memory fs #430

Closed
wants to merge 5 commits into from
Closed

Cache hf images from memory fs #430

wants to merge 5 commits into from

Conversation

dberenbaum
Copy link
Contributor

One more attempt to explore caching images from huggingface datasets. In this approach, each image is saved in a MemoryFileSystem and then cached as a regular image file, which means cache is always on for these objects.

Here is the performance for an example:

>>> from datachain import C, DataChain
>>> chain = DataChain.from_hf("lmms-lab/COCO-Caption2017", split="val")
>>> chain = chain.save("coco")
Processed: 1 rows [00:00, 966.21 rows/s]
Generated: 1 rows [00:00, 1028.77 rows/s]
Parsed Hugging Face dataset: 5000 rows [02:29, 33.40 rows/s]
Processed: 1 rows [02:29, 149.79s/ rows]02:29, 70.66 rows/s]
Generated: 5000 rows [02:27, 33.94 rows/s]
Saving: 5000 rows [00:00, 40493.06 rows/s]
Cleanup: 2 tables [00:00, 124.91 tables/s]
>>> images = list(chain.select("image").to_pytorch())
Saving: 5000 rows [00:00, 49880.65 rows/s]
Parsed PyTorch dataset for rank=0 worker: 5000 rows [00:16, 295.62 rows/s]

Performance during chain.save() may be hurt by not having an async implementation for MemoryFileSystem. It may be possible to further optimize how the images are saved to the cache.

Here is the same on the main branch for comparison:

>>> from datachain import C, DataChain
>>> chain = DataChain.from_hf("lmms-lab/COCO-Caption2017", split="val")
>>> chain = chain.save("coco")
Processed: 1 rows [00:00, 735.84 rows/s]
Generated: 1 rows [00:00, 765.80 rows/s]
Parsed Hugging Face dataset: 5000 rows [02:31, 32.97 rows/s]
Processed: 1 rows [02:32, 152.83s/ rows]02:31, 81.97 rows/s]
Generated: 5000 rows [02:30, 33.24 rows/s]
Saving: 5000 rows [00:01, 4584.52 rows/s]
Cleanup: 2 tables [00:00,  2.88 tables/s]
>>> images = list(chain.select("image").to_pytorch())
Saving: 5000 rows [00:01, 3841.78 rows/s]
Parsed PyTorch dataset for rank=0 worker: 5000 rows [00:15, 329.91 rows/s]

Performance looks pretty similar, so it's up to a preference between saving the images in the warehouse db or in the file cache.

Copy link

cloudflare-workers-and-pages bot commented Sep 11, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 91b7e81
Status: ✅  Deploy successful!
Preview URL: https://c6d94eb6.datachain-documentation.pages.dev
Branch Preview URL: https://hf-memfs.datachain-documentation.pages.dev

View logs

@shcheklein
Copy link
Member

Yep, thanks @dberenbaum . It's hard for me to tell tbh if one is better. If we don't see the difference (btw it might be a bit different performance with SaaS?) - then I would keep it as is for now (data in DB) and probably we should leave the data for training optimizations downstream (there can be a variety of those - special FS, bundling, etc?). WDYT?

@dberenbaum
Copy link
Contributor Author

Sounds good @shcheklein. This was more to create a record in case there is ever a desire to come back to it. Closing for now.

@dberenbaum dberenbaum closed this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants