Support "union by name" #3855

gmweaver · 2025-02-24T22:28:01Z

Is your feature request related to a problem?

Enable concatenating dataframes with different schemas by applying a union based on column names, filling with nulls for missing columns in each of the respective dataframes being combined.

Describe the solution you'd like

The equivalent of Spark's unionByName in Daft

Describe alternatives you've considered

Applying custom logic on my side to replicate unionByName

Additional Context

No response

Would you like to implement a fix?

No

rchowell · 2025-02-24T23:05:44Z

Could you try using a join or outer join on all common common columns? We will also include this feature!

Helper

def union_by_name(df1: DataFrame, df2: DataFrame, allow_missing_columns: bool) -> DataFrame:
    l_cols = set(df1.column_names)
    r_cols = set(df2.column_names)
    on = [col(c) for c in l_cols.intersection(r_cols)]
    how = "outer" if allow_missing_columns else "inner"
    return df1.join(df2, left_on=on, right_on=on, how=how)

Usage

import daft

df1 = daft.from_pydict({"a": ["x", "y", "S"], "b": [1, 2, 3]})
df2 = daft.from_pydict({"a": ["x", "y", "T"]})

union_by_name(df1, df2, allow_missing_columns=True).show()
"""
╭──────┬───────╮
│ a    ┆ b     │
│ ---  ┆ ---   │
│ Utf8 ┆ Int64 │
╞══════╪═══════╡
│ x    ┆ 1     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y    ┆ 2     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ S    ┆ 3     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ T    ┆ None  │
╰──────┴───────╯

(Showing first 4 of 4 rows)
"""

gmweaver · 2025-02-24T23:12:34Z

to avoid a join, would something like this work? not sure if this works for all complex types?

def _concat_by_name(df1: daft.DataFrame, df2: daft.DataFrame) -> daft.DataFrame:
    df1_cols = set((x.name, x.dtype) for x in df1.schema())
    df2_cols = set((x.name, x.dtype) for x in df2.schema())
    all_cols = [name for name, _ in df1_cols.union(df2_cols)]

    return (
        df1.with_columns(
            {name: daft.lit(None).cast(dt) for name, dt in df2_cols - df1_cols}
        ).select(*all_cols)
    ).concat(
        df2.with_columns(
            {name: daft.lit(None).cast(dt) for name, dt in df1_cols - df2_cols}
        ).select(*all_cols)
    )

rchowell · 2025-02-24T23:46:40Z

I am not certain that either of these are exactly equivalent, but the JOIN is like a UNION DISTINCT and the concat like a UNION ALL. You could also do a concat + filter distinct.

gmweaver · 2025-02-25T16:24:01Z

For my use case, I want a UNION ALL (no dedupe) and I believe this is what Spark's impl does as well.

gmweaver added enhancement New feature or request needs triage labels Feb 24, 2025

rchowell added p1 Important to tackle soon, but preemptable by p0 feat and removed needs triage labels Feb 24, 2025

rchowell self-assigned this Feb 25, 2025

rchowell assigned universalmind303 and unassigned rchowell Feb 25, 2025

universalmind303 linked a pull request Feb 27, 2025 that will close this issue

feat: union ops #3872

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "union by name" #3855

Support "union by name" #3855

gmweaver commented Feb 24, 2025

rchowell commented Feb 24, 2025 •

edited

Loading

gmweaver commented Feb 24, 2025 •

edited

Loading

rchowell commented Feb 24, 2025

gmweaver commented Feb 25, 2025

Support "union by name" #3855

Support "union by name" #3855

Comments

gmweaver commented Feb 24, 2025

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Would you like to implement a fix?

rchowell commented Feb 24, 2025 • edited Loading

gmweaver commented Feb 24, 2025 • edited Loading

rchowell commented Feb 24, 2025

gmweaver commented Feb 25, 2025

rchowell commented Feb 24, 2025 •

edited

Loading

gmweaver commented Feb 24, 2025 •

edited

Loading