Introduce @ operator for File path manipulations #299

volkfox · 2024-08-14T19:40:50Z

Description

There is a number of common operations to File path, e.g. {stem, dir, basename, ext} which are poorly supported.

For example, if we need to merge two datasets "images" and "meta" by the file stem, we currently require many manipulations to create and remove merge keys – which is not ideal:

import datachain.sql.functions.path

images = images.mutate(images_stem = path.stem("file.path")
meta = meta.mutate(meta_stem = path.stem("file.path")

annotated_images = images.merge(meta, on="image_stem", right_on="meta_stem")

annotated_images.select_except(["image_stem", "meta_stem"])

A preferred way is to create path-related functions on the fly:

annotated = images.merge(meta, on="file.path@stem", right_on="meta_file.path@stem")

The text was updated successfully, but these errors were encountered:

dmpetrov · 2024-08-14T20:03:33Z

That's amazing idea! We definitely need to simplify usage of the vectorized operations.

annotated = images.merge(meta, on="file.path@stem", right_on="meta_file.path@stem")

There are possible two approaches:

Apply general functions to columns: on="file.path@stem" (applies built-in stem() to columns)
Introduce special methods in data-model classes: on="file@stem" (where stem() is object method - it's a special method that does the translation). @shcheklein was sharing similar ideas in the latest night-sync-up meeting.

(1) seems straightforward and we should start with this while (2) has a bigger potential - users can define custom vectorized operation - but it requires additional research.

mattseddon · 2024-08-22T00:57:29Z

Is there a reason some of the DataChain methods accept strings (e.g. merge) and others accept Columns (e.g. filter)?

If merge accepted Columns in the same way that filter does then the syntax could be as follows:

images.merge(meta, on=path.file_stem(C("file.path")), right_on=path.file_stem(C("meta_file.path")))

Using this approach we wouldn't have to map our custom SQL functions to text identifiers and/or search for @ symbols in the provided strings.

We could also write a wrapper function that takes a string and returns the required SQL function. I.e

def stem(column_name):
    return path.file_stem(C("file.path"))

which would make the final merge syntax something like:

images.merge(meta, on=stem("file.path"), right_on=stem("meta_file.path"))

Using columns though would mean that users can join using any SQL function available through SQLAlchemy.

mattseddon · 2024-08-22T00:59:31Z

or do we want to move all functions away from the C syntax?

dmpetrov · 2024-09-02T18:45:02Z

It's the best we can do as the 1st step. Later, we will figure out if anything else is needed or this is enough.

mattseddon · 2024-09-04T02:01:42Z

I've still got a few things to work through but this is how I think the wds example will end up looking:

wds_images = (
    DataChain.from_storage(IMAGE_TARS)
    .settings(cache=True)
    .gen(laion=process_webdataset(spec=WDSLaion), params="file")
)

wds_with_pq = (
    DataChain.from_parquet(PARQUET_METADATA)
    .settings(cache=True)
    .merge(wds_images, on="uid", right_on="laion.json.uid", inner=True)
)

wds_npz = (
    DataChain.from_storage(NPZ_METADATA)
    .settings(cache=True)
    .gen(emd=process_laion_meta)
)


res = wds_npz.merge(
    wds_with_pq,
    on=[path.file_stem(wds_npz.c("emd.file.path")), "emd.index"],
    right_on=[path.file_stem(wds_with_pq.c("source.file.path")), "source.index"],
    inner=True,
).save("wds")

res.show(5)

Is this acceptable?

mattseddon · 2024-09-04T05:07:48Z

I've updated both examples/multimodal/clip_inference.py & examples/multimodal/wds.py in #388 to showcase the new syntax. This update also gives us a performance boost. PTAL.

volkfox added enhancement New feature or request priority-p2 labels Aug 14, 2024

dmpetrov added priority-p1 and removed priority-p2 labels Aug 14, 2024

mattseddon self-assigned this Sep 3, 2024

mattseddon mentioned this issue Sep 4, 2024

allow merge on expressions #388

Merged

mattseddon closed this as completed in #388 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce @ operator for File path manipulations #299

Introduce @ operator for File path manipulations #299

volkfox commented Aug 14, 2024

dmpetrov commented Aug 14, 2024

mattseddon commented Aug 22, 2024

mattseddon commented Aug 22, 2024

dmpetrov commented Sep 2, 2024

mattseddon commented Sep 4, 2024

mattseddon commented Sep 4, 2024

Introduce @ operator for File path manipulations #299

Introduce @ operator for File path manipulations #299

Comments

volkfox commented Aug 14, 2024

Description

dmpetrov commented Aug 14, 2024

mattseddon commented Aug 22, 2024

mattseddon commented Aug 22, 2024

dmpetrov commented Sep 2, 2024

mattseddon commented Sep 4, 2024

mattseddon commented Sep 4, 2024