-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor DataChain.from_storage()
to use new listing generator
#294
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #294 +/- ##
==========================================
+ Coverage 86.74% 86.79% +0.04%
==========================================
Files 92 92
Lines 10072 10124 +52
Branches 2046 2055 +9
==========================================
+ Hits 8737 8787 +50
- Misses 986 988 +2
Partials 349 349
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
5b5c823
to
1df4c26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems just a few things left. I'm fine if we hide client_config / catalog under kwargs for now (as it was before). session
in some APIs is fine as well.
Re the globs - if it was the same logic as before - only /*
is supported - I'm fine with that for this PR
There are a few questions left by Ronan and I had a new minor one I think.
I resolved almost everything else.
Let's get this merged soon :) Thanks for your patience @ilongin !
Co-authored-by: Ronan Lamy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! PLease remove the kwargs and config and it;s good to merge.
Refactoring
DataChain.from_storage()
to use new listing generator instead of calling underlyingDatasetQuery
indexing step which uses deprecated listing codebase.For each listing requested in
.from_storage()
method, new dataset will be created withlst_
prefix and uri, e.gDataChain.from_storage("s3://ldb-public/dogs-cats")
-> this will create dataset with namelst_s3://ldb-public/dogs-cats
which holds listing data.Schema of listing dataset will always be
{"file": "File@v1"}
, but user can specify specific type (e,gImageFile
orTextFile
) by adding argumenttype
e.g.from_storage("s3://ldb-public", type="text")
which would return chain with schema consisting of that specific type instead of genericFile
. This way we can, for example, filter out images from chain and set correctimage
type, e.gPartials listing is also done without partials table which makes them obsolete.
Storage table is also obsolete as now all information is in that special listing dataset from above.
Next steps:
CLI
to useDataChain.from_storage()
to enlist sources