-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added prefetch
to export files in parallel
#923
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #923 +/- ##
==========================================
+ Coverage 87.68% 87.74% +0.06%
==========================================
Files 130 130
Lines 11714 11731 +17
Branches 1594 1594
==========================================
+ Hits 10271 10293 +22
+ Misses 1043 1039 -4
+ Partials 400 399 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I like the simplicity of the implementation, @ilongin! However, prefetching uses a temporary cache when |
When datachain/src/datachain/lib/udf.py Lines 348 to 349 in de171c6
So, all the files are prefetched to a temporary location in the background. So what this PR does in that case is that it saves files to a temporary location first before "exporting" it back out from that cache. We set datachain/src/datachain/lib/file.py Lines 325 to 327 in de171c6
We also remove prefetched items after we run the mapper function too. And we remove the cache directory later. This change, as a result, also breaks the export of datachain/src/datachain/lib/udf.py Lines 310 to 313 in de171c6
Also, by "all the files are prefetched" above, I mean we prefetch all the datachain/src/datachain/lib/udf.py Lines 299 to 301 in de171c6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, couple comments below about naming/docstrings.
Not approving yet since @skshetry have a reasonable comment above.
src/datachain/lib/dc.py
Outdated
@@ -2451,6 +2451,7 @@ def export_files( | |||
placement: FileExportPlacement = "fullpath", | |||
use_cache: bool = True, | |||
link_type: Literal["copy", "symlink"] = "copy", | |||
prefetch: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure prefetch
is a good param name in export_files
function. "Prefetch" is usually about loading data, not saving. I know it was named to be the same as in "from_storage" method, but still, should we discuss naming here? IMO something like "parallel" or "concurrent" will be a better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to num_workers
as it's the most correct name I think
prefetch: number of workers to use for downloading files in advance. | ||
This is enabled by default and uses 2 workers. | ||
To disable prefetching, set it to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of workers to use for downloading files in advance
Mistype here? We are exporting files, not downloading 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
a92a64e
to
9d301a8
Compare
Adding
num_workers
argument toDataChain.to_storage(...)
function to speedup exporting which was done one by one up until now.Performance changes (tested on 200 images from
s3://ldb-public/remote/data-lakes/dogs-and-cats/
and no cache):