-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "union by name" #3855
Comments
Could you try using a join or outer join on all common common columns? We will also include this feature! Helper def union_by_name(df1: DataFrame, df2: DataFrame, allow_missing_columns: bool) -> DataFrame:
l_cols = set(df1.column_names)
r_cols = set(df2.column_names)
on = [col(c) for c in l_cols.intersection(r_cols)]
how = "outer" if allow_missing_columns else "inner"
return df1.join(df2, left_on=on, right_on=on, how=how) Usage import daft
df1 = daft.from_pydict({"a": ["x", "y", "S"], "b": [1, 2, 3]})
df2 = daft.from_pydict({"a": ["x", "y", "T"]})
union_by_name(df1, df2, allow_missing_columns=True).show()
"""
╭──────┬───────╮
│ a ┆ b │
│ --- ┆ --- │
│ Utf8 ┆ Int64 │
╞══════╪═══════╡
│ x ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ S ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ T ┆ None │
╰──────┴───────╯
(Showing first 4 of 4 rows)
""" |
to avoid a join, would something like this work? not sure if this works for all complex types? def _concat_by_name(df1: daft.DataFrame, df2: daft.DataFrame) -> daft.DataFrame:
df1_cols = set((x.name, x.dtype) for x in df1.schema())
df2_cols = set((x.name, x.dtype) for x in df2.schema())
all_cols = [name for name, _ in df1_cols.union(df2_cols)]
return (
df1.with_columns(
{name: daft.lit(None).cast(dt) for name, dt in df2_cols - df1_cols}
).select(*all_cols)
).concat(
df2.with_columns(
{name: daft.lit(None).cast(dt) for name, dt in df1_cols - df2_cols}
).select(*all_cols)
) |
I am not certain that either of these are exactly equivalent, but the JOIN is like a UNION DISTINCT and the concat like a UNION ALL. You could also do a concat + filter distinct. |
For my use case, I want a UNION ALL (no dedupe) and I believe this is what Spark's impl does as well. |
Is your feature request related to a problem?
Enable concatenating dataframes with different schemas by applying a union based on column names, filling with nulls for missing columns in each of the respective dataframes being combined.
Describe the solution you'd like
The equivalent of Spark's unionByName in Daft
Describe alternatives you've considered
Applying custom logic on my side to replicate unionByName
Additional Context
No response
Would you like to implement a fix?
No
The text was updated successfully, but these errors were encountered: