-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding FSSpec Export for CSV and Parquet #516
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #516 +/- ##
==========================================
+ Coverage 87.43% 87.46% +0.02%
==========================================
Files 97 97
Lines 10069 10089 +20
Branches 1374 1378 +4
==========================================
+ Hits 8804 8824 +20
Misses 908 908
Partials 357 357
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
tests/func/test_datachain.py
Outdated
df1 = dc_from.select("first_name", "age", "city").to_pandas() | ||
assert df1.equals(df) | ||
|
||
# Cleanup any written files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq - are we using real clouds? do we care about cleanup here? if we care - should it be wrapped into a fixture (so that we do this even in case of a failure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses simulated clouds provided by pytest-servers
- and the cleanup code is necessary to prevent test failures (as these uploaded files are not automatically cleaned up). I moved the cleanup code into a fixture to keep it in one location.
@@ -1887,6 +1889,7 @@ def to_parquet( | |||
path: Union[str, os.PathLike[str], BinaryIO], | |||
partition_cols: Optional[Sequence[str]] = None, | |||
chunk_size: int = DEFAULT_PARQUET_CHUNK_SIZE, | |||
fs_kwargs: Optional[dict[str, Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern here that we don't use fs_kwargs
in all other places (e.g. anon=True, or from_parquet
I think reads from a file object - file.get_fs() or something). can we do a bit of research on that end and unify or get rid of this additional kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kwargs are optional and are combined with the Catalog's client_config
for any unified use cases, such as configuration that applies to read and write. I added this optional kwargs parameter here to be used if users need to specify a write-only custom configuration, such as an access token to be used on write, etc. that is not (or may not apply) on read or for the whole application / chain. For example, a token can be specified for Hugging Face filesystems on write, as described here: https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system#authentication but users may only want to specify this token on write to Hugging Face, not for other clouds or on read. I can rename this or change as desired, but it seems like having extra write-only kwargs can be useful in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk. It seems to me that it should be symmetrical (people might need to provide something extra on reads as well eventually) and then the question is - will we be able to do the same easy and w/o changing some logic in those methods (like from_parquet, etc).
should we update the docs here btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the docs should be updated - I updated them in the latest commit. And shared (read and write) kwargs can be provided in client_config
on Session or Catalog (as well as in environment variables) and these configuration settings will automatically be used for read and write. This fs_kwargs
option just provides a way to provide write-only configuration, or override the shared configuration, only if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but I think the docs should be updated to mention that fsspec URLs work as well.
Agreed, updated the docs in the latest commit. |
This adds the ability to export / upload to FSSpec filesystems in
to_csv
andto_parquet
- such as exporting directly to S3 or Hugging Face. This is done by passing the relevant url path to the export functions, such as:chain.to_parquet("s3://dtulga-datachain-test/test.parquet")
chain.to_csv("hf://datasets/dtulga/datachain-test/test.csv")
This has been tested manually with S3 and Hugging Face, as seen here: https://huggingface.co/datasets/dtulga/datachain-test/tree/main This is part of #236 and #370