Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "union by name" #3855

Open
gmweaver opened this issue Feb 24, 2025 · 4 comments · May be fixed by #3872
Open

Support "union by name" #3855

gmweaver opened this issue Feb 24, 2025 · 4 comments · May be fixed by #3872
Assignees
Labels
enhancement New feature or request feat p1 Important to tackle soon, but preemptable by p0

Comments

@gmweaver
Copy link
Contributor

Is your feature request related to a problem?

Enable concatenating dataframes with different schemas by applying a union based on column names, filling with nulls for missing columns in each of the respective dataframes being combined.

Describe the solution you'd like

The equivalent of Spark's unionByName in Daft

Describe alternatives you've considered

Applying custom logic on my side to replicate unionByName

Additional Context

No response

Would you like to implement a fix?

No

@gmweaver gmweaver added enhancement New feature or request needs triage labels Feb 24, 2025
@rchowell rchowell added p1 Important to tackle soon, but preemptable by p0 feat and removed needs triage labels Feb 24, 2025
@rchowell
Copy link
Contributor

rchowell commented Feb 24, 2025

Could you try using a join or outer join on all common common columns? We will also include this feature!

Helper

def union_by_name(df1: DataFrame, df2: DataFrame, allow_missing_columns: bool) -> DataFrame:
    l_cols = set(df1.column_names)
    r_cols = set(df2.column_names)
    on = [col(c) for c in l_cols.intersection(r_cols)]
    how = "outer" if allow_missing_columns else "inner"
    return df1.join(df2, left_on=on, right_on=on, how=how)

Usage

import daft

df1 = daft.from_pydict({"a": ["x", "y", "S"], "b": [1, 2, 3]})
df2 = daft.from_pydict({"a": ["x", "y", "T"]})

union_by_name(df1, df2, allow_missing_columns=True).show()
"""
╭──────┬───────╮
│ a    ┆ b     │
│ ---  ┆ ---   │
│ Utf8 ┆ Int64 │
╞══════╪═══════╡
│ x    ┆ 1     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y    ┆ 2     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ S    ┆ 3     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ T    ┆ None  │
╰──────┴───────╯

(Showing first 4 of 4 rows)
"""

@gmweaver
Copy link
Contributor Author

gmweaver commented Feb 24, 2025

to avoid a join, would something like this work? not sure if this works for all complex types?

def _concat_by_name(df1: daft.DataFrame, df2: daft.DataFrame) -> daft.DataFrame:
    df1_cols = set((x.name, x.dtype) for x in df1.schema())
    df2_cols = set((x.name, x.dtype) for x in df2.schema())
    all_cols = [name for name, _ in df1_cols.union(df2_cols)]

    return (
        df1.with_columns(
            {name: daft.lit(None).cast(dt) for name, dt in df2_cols - df1_cols}
        ).select(*all_cols)
    ).concat(
        df2.with_columns(
            {name: daft.lit(None).cast(dt) for name, dt in df1_cols - df2_cols}
        ).select(*all_cols)
    )

@rchowell
Copy link
Contributor

I am not certain that either of these are exactly equivalent, but the JOIN is like a UNION DISTINCT and the concat like a UNION ALL. You could also do a concat + filter distinct.

@rchowell rchowell self-assigned this Feb 25, 2025
@gmweaver
Copy link
Contributor Author

For my use case, I want a UNION ALL (no dedupe) and I believe this is what Spark's impl does as well.

@universalmind303 universalmind303 linked a pull request Feb 27, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat p1 Important to tackle soon, but preemptable by p0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants