Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce @ operator for File path manipulations #299

Closed
volkfox opened this issue Aug 14, 2024 · 6 comments · Fixed by #388
Closed

Introduce @ operator for File path manipulations #299

volkfox opened this issue Aug 14, 2024 · 6 comments · Fixed by #388
Assignees
Labels
enhancement New feature or request priority-p1

Comments

@volkfox
Copy link
Contributor

volkfox commented Aug 14, 2024

Description

There is a number of common operations to File path, e.g. {stem, dir, basename, ext} which are poorly supported.

For example, if we need to merge two datasets "images" and "meta" by the file stem, we currently require many manipulations to create and remove merge keys – which is not ideal:

import datachain.sql.functions.path

images = images.mutate(images_stem = path.stem("file.path")
meta = meta.mutate(meta_stem = path.stem("file.path")

annotated_images = images.merge(meta, on="image_stem", right_on="meta_stem")

annotated_images.select_except(["image_stem", "meta_stem"])

A preferred way is to create path-related functions on the fly:

annotated = images.merge(meta, on="file.path@stem", right_on="meta_file.path@stem")
@volkfox volkfox added enhancement New feature or request priority-p2 labels Aug 14, 2024
@dmpetrov
Copy link
Member

That's amazing idea! We definitely need to simplify usage of the vectorized operations.

annotated = images.merge(meta, on="file.path@stem", right_on="meta_file.path@stem")

There are possible two approaches:

  1. Apply general functions to columns: on="file.path@stem" (applies built-in stem() to columns)
  2. Introduce special methods in data-model classes: on="file@stem" (where stem() is object method - it's a special method that does the translation). @shcheklein was sharing similar ideas in the latest night-sync-up meeting.

(1) seems straightforward and we should start with this while (2) has a bigger potential - users can define custom vectorized operation - but it requires additional research.

@mattseddon
Copy link
Member

Is there a reason some of the DataChain methods accept strings (e.g. merge) and others accept Columns (e.g. filter)?

If merge accepted Columns in the same way that filter does then the syntax could be as follows:

images.merge(meta, on=path.file_stem(C("file.path")), right_on=path.file_stem(C("meta_file.path")))

Using this approach we wouldn't have to map our custom SQL functions to text identifiers and/or search for @ symbols in the provided strings.

We could also write a wrapper function that takes a string and returns the required SQL function. I.e

def stem(column_name):
    return path.file_stem(C("file.path"))

which would make the final merge syntax something like:

images.merge(meta, on=stem("file.path"), right_on=stem("meta_file.path"))

Using columns though would mean that users can join using any SQL function available through SQLAlchemy.

@mattseddon
Copy link
Member

or do we want to move all functions away from the C syntax?

@dmpetrov
Copy link
Member

dmpetrov commented Sep 2, 2024

It's the best we can do as the 1st step. Later, we will figure out if anything else is needed or this is enough.

@mattseddon mattseddon self-assigned this Sep 3, 2024
@mattseddon
Copy link
Member

I've still got a few things to work through but this is how I think the wds example will end up looking:

wds_images = (
    DataChain.from_storage(IMAGE_TARS)
    .settings(cache=True)
    .gen(laion=process_webdataset(spec=WDSLaion), params="file")
)

wds_with_pq = (
    DataChain.from_parquet(PARQUET_METADATA)
    .settings(cache=True)
    .merge(wds_images, on="uid", right_on="laion.json.uid", inner=True)
)

wds_npz = (
    DataChain.from_storage(NPZ_METADATA)
    .settings(cache=True)
    .gen(emd=process_laion_meta)
)


res = wds_npz.merge(
    wds_with_pq,
    on=[path.file_stem(wds_npz.c("emd.file.path")), "emd.index"],
    right_on=[path.file_stem(wds_with_pq.c("source.file.path")), "source.index"],
    inner=True,
).save("wds")

res.show(5)

Is this acceptable?

@mattseddon
Copy link
Member

I've updated both examples/multimodal/clip_inference.py & examples/multimodal/wds.py in #388 to showcase the new syntax. This update also gives us a performance boost. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority-p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants