Added `prefetch` to export files in parallel #923

ilongin · 2025-02-13T12:04:59Z

Adding num_workers argument to DataChain.to_storage(...) function to speedup exporting which was done one by one up until now.

Performance changes (tested on 200 images from s3://ldb-public/remote/data-lakes/dogs-and-cats/ and no cache):

Num Workers	1	2	5 (default)	10	15
Time	99s	52s	23s	13s	10.5s

codecov · 2025-02-13T12:10:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.74%. Comparing base (491aab4) to head (5204fa2).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #923      +/-   ##
==========================================
+ Coverage   87.68%   87.74%   +0.06%     
==========================================
  Files         130      130              
  Lines       11714    11731      +17     
  Branches     1594     1594              
==========================================
+ Hits        10271    10293      +22     
+ Misses       1043     1039       -4     
+ Partials      400      399       -1

Flag	Coverage Δ
datachain	`87.66% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skshetry · 2025-02-13T14:35:38Z

I like the simplicity of the implementation, @ilongin! However, prefetching uses a temporary cache when cache=False, so this solution might not be ideal in that case.

ilongin · 2025-02-13T14:46:34Z

I like the simplicity of the implementation, @ilongin! However, prefetching uses a temporary cache when cache=False, so this solution might not be ideal in that case.

@skshetry can you explain why it's not ideal? I'm not sure I 100% understand the implication

skshetry · 2025-02-13T15:00:20Z

I like the simplicity of the implementation, @ilongin! However, prefetching uses a temporary cache when cache=False, so this solution might not be ideal in that case.

@skshetry can you explain why it's not ideal? I'm not sure I 100% understand the implication

When prefetch>0 and cache=False, prefetching uses a temporary cache.

datachain/src/datachain/lib/udf.py

Lines 348 to 349 in de171c6

    
           if prefetch and not use_cache: 
        
               return temporary_cache(tmp_dir, prefix="prefetch-")

So, all the files are prefetched to a temporary location in the background. So what this PR does in that case is that it saves files to a temporary location first before "exporting" it back out from that cache. We set caching_enabled=True so all the file operations will also try to use the cache. This is not ideal for performance.

datachain/src/datachain/lib/file.py

Lines 325 to 327 in de171c6

    
           self._set_stream( 
        
               self._catalog, caching_enabled=True, download_cb=DEFAULT_CALLBACK 
        
           )

We also remove prefetched items after we run the mapper function too. And we remove the cache directory later. This change, as a result, also breaks the export of symlink link type.

datachain/src/datachain/lib/udf.py

Lines 310 to 313 in de171c6

    
           try: 
        
               catalog.cache.remove(obj) 
        
           except Exception as e:  # noqa: BLE001 
        
               print(f"Failed to remove prefetched item {obj.name!r}: {e!s}")

Also, by "all the files are prefetched" above, I mean we prefetch all the File signals, which might also be suboptimal.

datachain/src/datachain/lib/udf.py

Lines 299 to 301 in de171c6

    
           for obj in row: 
        
               if isinstance(obj, File) and await obj._prefetch(download_cb): 
        
                   after_prefetch()

dreadatour

Code looks good to me, couple comments below about naming/docstrings.
Not approving yet since @skshetry have a reasonable comment above.

dreadatour · 2025-02-16T12:57:49Z

src/datachain/lib/dc.py

@@ -2451,6 +2451,7 @@ def export_files(
        placement: FileExportPlacement = "fullpath",
        use_cache: bool = True,
        link_type: Literal["copy", "symlink"] = "copy",
+        prefetch: Optional[int] = None,


Not really sure prefetch is a good param name in export_files function. "Prefetch" is usually about loading data, not saving. I know it was named to be the same as in "from_storage" method, but still, should we discuss naming here? IMO something like "parallel" or "concurrent" will be a better option.

Changed to num_workers as it's the most correct name I think

dreadatour · 2025-02-16T12:59:00Z

src/datachain/lib/dc.py

+            prefetch: number of workers to use for downloading files in advance.
+                      This is enabled by default and uses 2 workers.
+                      To disable prefetching, set it to 0.


number of workers to use for downloading files in advance

Mistype here? We are exporting files, not downloading 👀

cloudflare-workers-and-pages · 2025-02-17T13:08:52Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`5204fa2`
Status:	🚫 Build failed.

View logs

using prefetch to export files in parallel

35aadb5

ilongin linked an issue Feb 13, 2025 that may be closed by this pull request

Make export files async and parallel. #882

Open

ilongin requested review from shcheklein, dreadatour, skshetry and amritghimire February 13, 2025 12:05

dreadatour reviewed Feb 16, 2025

View reviewed changes

ilongin added 2 commits February 17, 2025 11:16

merging with main

10f7f04

working on prefetch

6cd35dc

ilongin added 2 commits February 17, 2025 15:13

using node thread pool for file export

c1f387c

using node thread pool for file export

9d301a8

ilongin force-pushed the ilongin/882-export-files-in-parallel branch from a92a64e to 9d301a8 Compare February 17, 2025 14:15

ilongin requested a review from dreadatour February 17, 2025 14:16

Merge branch 'main' into ilongin/882-export-files-in-parallel

5204fa2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `prefetch` to export files in parallel #923

Added `prefetch` to export files in parallel #923

ilongin commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

skshetry commented Feb 13, 2025

ilongin commented Feb 13, 2025

skshetry commented Feb 13, 2025 •

edited

Loading

dreadatour left a comment

dreadatour Feb 16, 2025

ilongin Feb 17, 2025

dreadatour Feb 16, 2025

ilongin Feb 17, 2025

cloudflare-workers-and-pages bot commented Feb 17, 2025 •

edited

Loading

Added prefetch to export files in parallel #923

Are you sure you want to change the base?

Added prefetch to export files in parallel #923

Conversation

ilongin commented Feb 13, 2025 • edited Loading

codecov bot commented Feb 13, 2025 • edited Loading

Codecov Report

skshetry commented Feb 13, 2025

ilongin commented Feb 13, 2025

skshetry commented Feb 13, 2025 • edited Loading

dreadatour left a comment

Choose a reason for hiding this comment

dreadatour Feb 16, 2025

Choose a reason for hiding this comment

ilongin Feb 17, 2025

Choose a reason for hiding this comment

dreadatour Feb 16, 2025

Choose a reason for hiding this comment

ilongin Feb 17, 2025

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Feb 17, 2025 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

Added `prefetch` to export files in parallel #923

Added `prefetch` to export files in parallel #923

ilongin commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

skshetry commented Feb 13, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 17, 2025 •

edited

Loading