-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce @ operator for File path manipulations #299
Comments
That's amazing idea! We definitely need to simplify usage of the vectorized operations.
There are possible two approaches:
(1) seems straightforward and we should start with this while (2) has a bigger potential - users can define custom vectorized operation - but it requires additional research. |
Is there a reason some of the If merge accepted images.merge(meta, on=path.file_stem(C("file.path")), right_on=path.file_stem(C("meta_file.path"))) Using this approach we wouldn't have to map our custom SQL functions to text identifiers and/or search for We could also write a wrapper function that takes a string and returns the required SQL function. I.e def stem(column_name):
return path.file_stem(C("file.path")) which would make the final merge syntax something like: images.merge(meta, on=stem("file.path"), right_on=stem("meta_file.path")) Using columns though would mean that users can join using any SQL function available through SQLAlchemy. |
or do we want to move all functions away from the |
It's the best we can do as the 1st step. Later, we will figure out if anything else is needed or this is enough. |
I've still got a few things to work through but this is how I think the wds_images = (
DataChain.from_storage(IMAGE_TARS)
.settings(cache=True)
.gen(laion=process_webdataset(spec=WDSLaion), params="file")
)
wds_with_pq = (
DataChain.from_parquet(PARQUET_METADATA)
.settings(cache=True)
.merge(wds_images, on="uid", right_on="laion.json.uid", inner=True)
)
wds_npz = (
DataChain.from_storage(NPZ_METADATA)
.settings(cache=True)
.gen(emd=process_laion_meta)
)
res = wds_npz.merge(
wds_with_pq,
on=[path.file_stem(wds_npz.c("emd.file.path")), "emd.index"],
right_on=[path.file_stem(wds_with_pq.c("source.file.path")), "source.index"],
inner=True,
).save("wds")
res.show(5) Is this acceptable? |
I've updated both |
Description
There is a number of common operations to File path, e.g. {stem, dir, basename, ext} which are poorly supported.
For example, if we need to merge two datasets "images" and "meta" by the file stem, we currently require many manipulations to create and remove merge keys – which is not ideal:
A preferred way is to create path-related functions on the fly:
The text was updated successfully, but these errors were encountered: