Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor DataChain.from_storage() to use new listing generator #294

Merged
merged 61 commits into from
Sep 2, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Aug 14, 2024

Refactoring DataChain.from_storage() to use new listing generator instead of calling underlying DatasetQuery indexing step which uses deprecated listing codebase.

For each listing requested in .from_storage() method, new dataset will be created with lst_ prefix and uri, e.g
DataChain.from_storage("s3://ldb-public/dogs-cats") -> this will create dataset with name lst_s3://ldb-public/dogs-cats which holds listing data.

Schema of listing dataset will always be {"file": "File@v1"}, but user can specify specific type (e,g ImageFile or TextFile) by adding argument type e.g .from_storage("s3://ldb-public", type="text") which would return chain with schema consisting of that specific type instead of generic File. This way we can, for example, filter out images from chain and set correct image type, e.g

# listing is done in first from_storage() call where schema is File but reading it as ImageFile
img_dc = DataChain.from_storage("gs://datachain-demo", type="image").filter(C("file.path").glob("*.jpg"))
# using cached listing from above but reading it as TextFile this time
txt_dc = DataChain.from_storage("gs://datachain-demo", type="text").filter(C("file.path").glob("*.txt"))  

Partials listing is also done without partials table which makes them obsolete.
Storage table is also obsolete as now all information is in that special listing dataset from above.

Next steps:

  • Make listing caching lazy
  • Refactor CLI to use DataChain.from_storage() to enlist sources
  • Remove obsolete codebase

@ilongin ilongin marked this pull request as draft August 14, 2024 00:26
@ilongin ilongin linked an issue Aug 14, 2024 that may be closed by this pull request
@ilongin ilongin changed the base branch from main to ilongin/244-generator-to-list-bucket August 14, 2024 00:26
Copy link

cloudflare-workers-and-pages bot commented Aug 14, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 62045cd
Status: ✅  Deploy successful!
Preview URL: https://d42b5923.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-266-refactor-from-st.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Aug 14, 2024

Codecov Report

Attention: Patch coverage is 93.67089% with 5 lines in your changes missing coverage. Please review.

Project coverage is 86.79%. Comparing base (315816e) to head (62045cd).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/lib/meta_formats.py 0.00% 3 Missing ⚠️
src/datachain/lib/dc.py 91.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #294      +/-   ##
==========================================
+ Coverage   86.74%   86.79%   +0.04%     
==========================================
  Files          92       92              
  Lines       10072    10124      +52     
  Branches     2046     2055       +9     
==========================================
+ Hits         8737     8787      +50     
- Misses        986      988       +2     
  Partials      349      349              
Flag Coverage Δ
datachain 86.72% <93.67%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from ilongin/244-generator-to-list-bucket to main August 15, 2024 22:42
@ilongin ilongin force-pushed the ilongin/266-refactor-from-storage branch from 5b5c823 to 1df4c26 Compare August 15, 2024 23:54
@ilongin ilongin requested review from shcheklein and dmpetrov August 27, 2024 12:21
src/datachain/lib/listing.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/dc.py Outdated Show resolved Hide resolved
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems just a few things left. I'm fine if we hide client_config / catalog under kwargs for now (as it was before). session in some APIs is fine as well.

Re the globs - if it was the same logic as before - only /* is supported - I'm fine with that for this PR

There are a few questions left by Ronan and I had a new minor one I think.

I resolved almost everything else.

Let's get this merged soon :) Thanks for your patience @ilongin !

@ilongin ilongin requested review from rlamy and shcheklein August 29, 2024 16:43
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! PLease remove the kwargs and config and it;s good to merge.

@ilongin ilongin merged commit bd52a96 into main Sep 2, 2024
38 checks passed
@ilongin ilongin deleted the ilongin/266-refactor-from-storage branch September 2, 2024 09:21
@rlamy rlamy mentioned this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor .from_storage() to use listing generator
6 participants