Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pyarrow-related dispatch logic in dask_cudf #14069

Merged
merged 98 commits into from
Sep 18, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
19f5174
Merge pull request #4714 from rapidsai/branch-0.13
raydouglass Mar 30, 2020
a2804c3
REL v0.13.0 release
GPUtester Mar 31, 2020
fef2a2b
REL v0.13.0 CHANGELOG Updates
mike-wendt Apr 1, 2020
ab00eb0
Merge pull request #5310 from rapidsai/branch-0.14
raydouglass Jun 3, 2020
b34b838
REL v0.14.0 release
GPUtester Jun 3, 2020
9ff9cdb
update master references
ajschmidt8 Jul 14, 2020
789d19b
REL DOC Updates for main branch switch
mike-wendt Jul 16, 2020
819f514
Merge pull request #6079 from rapidsai/branch-0.15
raydouglass Aug 26, 2020
3a0f214
REL v0.15.0 release
GPUtester Aug 26, 2020
f947393
Merge pull request #6101 from rapidsai/branch-0.15
raydouglass Aug 27, 2020
71cb8c0
REL v0.15.0 release
GPUtester Aug 27, 2020
7ef8174
Merge pull request #6547 from rapidsai/branch-0.16
raydouglass Oct 21, 2020
2b8298f
REL v0.16.0 release
GPUtester Oct 21, 2020
d72b1eb
Merge pull request #6935 from rapidsai/branch-0.17
ajschmidt8 Dec 10, 2020
f56ef85
REL v0.17.0 release
GPUtester Dec 10, 2020
b7e1a85
Merge pull request #7405 from rapidsai/branch-0.18
raydouglass Feb 24, 2021
20778e5
REL v0.18.0 release
GPUtester Feb 24, 2021
042c20f
Merge pull request #7585 from rapidsai/branch-0.18
raydouglass Mar 15, 2021
999be56
REL v0.18.1 release
raydouglass Mar 15, 2021
2391864
Merge pull request #7969 from rapidsai/branch-0.18
raydouglass Apr 15, 2021
3341561
REL v0.18.2 release
raydouglass Apr 15, 2021
6573759
Merge pull request #7626 from rapidsai/branch-0.19
raydouglass Apr 21, 2021
f07b251
REL v0.19.0 release
GPUtester Apr 21, 2021
61e5a20
REL Changelog update
ajschmidt8 Apr 21, 2021
a13e8dc
Merge pull request #8037 from rapidsai/branch-0.19
raydouglass Apr 22, 2021
a9f3453
REL v0.19.1 release
GPUtester Apr 22, 2021
2089fc9
Merge pull request #8100 from rapidsai/branch-0.19
raydouglass Apr 28, 2021
ab3b3f6
REL v0.19.2 release
GPUtester Apr 28, 2021
f9d5e2e
Merge pull request #8418 from rapidsai/branch-21.06
raydouglass Jun 9, 2021
ae44046
REL v21.06.00 release
GPUtester Jun 9, 2021
3b831c3
Merge pull request #8488 from rapidsai/branch-21.06
ajschmidt8 Jun 10, 2021
d56ac1d
Merge pull request #8542 from rapidsai/branch-21.06
raydouglass Jun 17, 2021
cddc64f
REL v21.06.01 release
GPUtester Jun 17, 2021
101fc0f
REL Merge pull request #8544 from rapidsai/branch-21.06
raydouglass Jun 17, 2021
e9dabf8
Merge pull request #8840 from rapidsai/branch-21.08
raydouglass Aug 4, 2021
106039c
REL v21.08.00 release
GPUtester Aug 4, 2021
8055721
Merge pull request #8986 from rapidsai/branch-21.08
raydouglass Aug 6, 2021
e0a8114
REL v21.08.01 release
GPUtester Aug 6, 2021
a7391e6
Merge pull request #8990 from rapidsai/branch-21.08
raydouglass Aug 6, 2021
f6d31fa
REL v21.08.02 release
GPUtester Aug 6, 2021
dff45e5
Merge pull request #9116 from rapidsai/branch-21.08
ajschmidt8 Sep 16, 2021
e4313b6
REL v21.08.03 release
GPUtester Sep 16, 2021
5638329
Merge pull request #9301 from rapidsai/branch-21.10
ajschmidt8 Oct 6, 2021
072fd86
REL v21.10.00 release
GPUtester Oct 6, 2021
8cfb8e5
Merge pull request #9420 from rapidsai/branch-21.10
raydouglass Oct 12, 2021
a1d2d13
REL v21.10.01 release
GPUtester Oct 12, 2021
3ceb0c0
Merge pull request #9689 from rapidsai/branch-21.12
raydouglass Dec 3, 2021
f1ef2d2
REL v21.12.00 release
GPUtester Dec 3, 2021
fd04831
Merge pull request #9880 from rapidsai/branch-21.12
raydouglass Dec 9, 2021
a0a0a3a
REL v21.12.01 release
GPUtester Dec 9, 2021
c74e24f
Merge pull request #9924 from rapidsai/branch-21.12
raydouglass Dec 16, 2021
06540b9
REL v21.12.02 release
GPUtester Dec 16, 2021
f39f559
Merge pull request #10101 from rapidsai/branch-22.02
raydouglass Feb 2, 2022
774d859
REL v22.02.00 release
GPUtester Feb 2, 2022
803c42a
Merge pull request #10512 from rapidsai/branch-22.04
raydouglass Apr 6, 2022
8bf0520
REL v22.04.00 release
GPUtester Apr 6, 2022
0363197
REL Merge pull request #10633 from rapidsai/branch-22.04
raydouglass Apr 11, 2022
89c7736
Merge pull request #10969 from rapidsai/branch-22.06
raydouglass Jun 7, 2022
5658c5b
REL v22.06.00 release
GPUtester Jun 7, 2022
a1fe591
Merge pull request #11208 from rapidsai/branch-22.06
raydouglass Jul 6, 2022
0dab0f8
REL v22.06.01 release
GPUtester Jul 6, 2022
a7f8de5
Merge pull request #11444 from rapidsai/branch-22.08
raydouglass Aug 17, 2022
b71873c
REL v22.08.00 release
GPUtester Aug 17, 2022
aa58765
pin numpy version (#11824)
galipremsagar Sep 29, 2022
78d3655
Merge pull request #11826 from rapidsai/branch-22.08
raydouglass Sep 29, 2022
31337c9
REL v22.08.01 release
GPUtester Sep 29, 2022
b466b6a
Merge pull request #11858 from rapidsai/branch-22.10
raydouglass Oct 12, 2022
8ffe375
REL v22.10.00 release
GPUtester Oct 12, 2022
432fb37
Merge pull request #12061 from rapidsai/branch-22.10
raydouglass Nov 3, 2022
d90f7e9
REL v22.10.01 release
GPUtester Nov 3, 2022
ca9a422
REL Merge pull request #12069 from rapidsai/branch-22.10
raydouglass Nov 4, 2022
a7dcfdf
Merge pull request #12200 from rapidsai/branch-22.12
raydouglass Dec 8, 2022
baae3a6
REL v22.12.00 release
GPUtester Dec 8, 2022
b2dfcdf
Merge pull request #12346 from rapidsai/branch-22.12
raydouglass Dec 8, 2022
f700408
REL v22.12.01 release
GPUtester Dec 8, 2022
93c5b34
Merge pull request #12660 from rapidsai/branch-23.02
raydouglass Feb 9, 2023
d5b59a2
Merge pull request #12746 from rapidsai/branch-23.02
raydouglass Feb 9, 2023
5ad4a85
REL v23.02.00 release
raydouglass Feb 9, 2023
471fa64
Merge pull request #13038 from rapidsai/branch-23.04
raydouglass Apr 12, 2023
cd71208
REL v23.04.00 release
raydouglass Apr 12, 2023
4d31a6f
REL v23.04.00 release
raydouglass Apr 12, 2023
d023acc
Merge pull request #13197 from rapidsai/branch-23.04
raydouglass Apr 21, 2023
7e070fc
REL v23.04.01 release
raydouglass Apr 21, 2023
88cb6db
REL Merge pull request #13280 from rapidsai/branch-23.04
raydouglass May 3, 2023
4548010
Merge remote-tracking branch 'upstream/branch-23.06'
raydouglass Jun 7, 2023
f881d40
REL v23.06.00 release
raydouglass Jun 7, 2023
7d33d20
Merge pull request #13640 from rapidsai/branch-23.06
raydouglass Jun 29, 2023
6a548b0
REL v23.06.01 release
raydouglass Jun 29, 2023
d9589b7
Merge pull request #13781 from rapidsai/branch-23.08
raydouglass Aug 9, 2023
8150d38
REL v23.08.00 release
raydouglass Aug 9, 2023
88b07d5
add preserve_index arg and remove old try/except logic
rjzamora Sep 8, 2023
c5f1339
Merge branch 'branch-23.10' into preserve-index-dispatch
galipremsagar Sep 8, 2023
8d580d4
avoid data movement for pyarrow_schema_dispatch
rjzamora Sep 13, 2023
8cefa9c
Merge remote-tracking branch 'upstream/main' into preserve-index-disp…
rjzamora Sep 13, 2023
329be42
Merge branch 'preserve-index-dispatch' of https://github.com/rjzamora…
rjzamora Sep 13, 2023
3c9499b
Merge remote-tracking branch 'upstream/branch-23.10' into preserve-in…
rjzamora Sep 13, 2023
4e09ac3
reset changelog
rjzamora Sep 13, 2023
67a189f
Merge branch 'branch-23.10' into preserve-index-dispatch
rjzamora Sep 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 32 additions & 37 deletions python/dask_cudf/dask_cudf/backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@
from dask.dataframe.dispatch import (
categorical_dtype_dispatch,
concat_dispatch,
from_pyarrow_table_dispatch,
group_split_dispatch,
grouper_dispatch,
hash_object_dispatch,
is_categorical_dtype_dispatch,
make_meta_dispatch,
pyarrow_schema_dispatch,
to_pyarrow_table_dispatch,
tolist_dispatch,
union_categoricals_dispatch,
)
Expand Down Expand Up @@ -317,16 +320,6 @@ def get_grouper_cudf(obj):
return cudf.core.groupby.Grouper


try:
from dask.dataframe.dispatch import pyarrow_schema_dispatch

@pyarrow_schema_dispatch.register((cudf.DataFrame,))
def get_pyarrow_schema_cudf(obj):
return obj.to_arrow().schema

except ImportError:
pass

try:
try:
from dask.array.dispatch import percentile_lookup
Expand Down Expand Up @@ -378,35 +371,37 @@ def percentile_cudf(a, q, interpolation="linear"):
except ImportError:
pass

try:
# Requires dask>2023.6.0
from dask.dataframe.dispatch import (
from_pyarrow_table_dispatch,
to_pyarrow_table_dispatch,
)

@to_pyarrow_table_dispatch.register(cudf.DataFrame)
def _cudf_to_table(obj, preserve_index=True, **kwargs):
if kwargs:
warnings.warn(
"Ignoring the following arguments to "
f"`to_pyarrow_table_dispatch`: {list(kwargs)}"
)
return obj.to_arrow(preserve_index=preserve_index)

@from_pyarrow_table_dispatch.register(cudf.DataFrame)
def _table_to_cudf(obj, table, self_destruct=None, **kwargs):
# cudf ignores self_destruct.
kwargs.pop("self_destruct", None)
if kwargs:
warnings.warn(
f"Ignoring the following arguments to "
f"`from_pyarrow_table_dispatch`: {list(kwargs)}"
)
return obj.from_arrow(table)
@pyarrow_schema_dispatch.register((cudf.DataFrame,))
def _get_pyarrow_schema_cudf(obj, preserve_index=True, **kwargs):
if kwargs:
warnings.warn(
"Ignoring the following arguments to "
f"`pyarrow_schema_dispatch`: {list(kwargs)}"
)
return obj.to_arrow(preserve_index=preserve_index).schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): ‏If dask asks for the schema separately from the table it would be good to figure out a way to provide a pyarrow schema for a cudf dataframe without necessarily copying the full frame to host. Ideally producing this metadata should be O(1) rather than (as it is now) O(size-of-frame).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. The specific logic in _get_pyarrow_schema_cudf wasn't really part of this PR, but that's an excellent point. It definitely feels like we should be able to produce a schema without moving all the data!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using meta_nonempty(obj) seems to do the trick here.


except ImportError:
pass

@to_pyarrow_table_dispatch.register(cudf.DataFrame)
def _cudf_to_table(obj, preserve_index=True, **kwargs):
if kwargs:
warnings.warn(
"Ignoring the following arguments to "
f"`to_pyarrow_table_dispatch`: {list(kwargs)}"
)
return obj.to_arrow(preserve_index=preserve_index)


@from_pyarrow_table_dispatch.register(cudf.DataFrame)
def _table_to_cudf(obj, table, self_destruct=None, **kwargs):
# cudf ignores self_destruct.
kwargs.pop("self_destruct", None)
if kwargs:
warnings.warn(
f"Ignoring the following arguments to "
f"`from_pyarrow_table_dispatch`: {list(kwargs)}"
)
return obj.from_arrow(table)


@union_categoricals_dispatch.register((cudf.Series, cudf.BaseIndex))
Expand Down
21 changes: 15 additions & 6 deletions python/dask_cudf/dask_cudf/tests/test_dispatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
import numpy as np
import pandas as pd
import pytest
from packaging import version

import dask
from dask.base import tokenize
from dask.dataframe import assert_eq
from dask.dataframe.methods import is_categorical_dtype
Expand All @@ -24,10 +22,6 @@ def test_is_categorical_dispatch():
assert is_categorical_dtype(cudf.Index([1, 2, 3], dtype="category"))


@pytest.mark.skipif(
version.parse(dask.__version__) <= version.parse("2023.6.0"),
reason="Pyarrow-conversion dispatch requires dask>2023.6.0",
)
def test_pyarrow_conversion_dispatch():
from dask.dataframe.dispatch import (
from_pyarrow_table_dispatch,
Expand Down Expand Up @@ -79,3 +73,18 @@ def test_deterministic_tokenize(index):
df2 = df.set_index(["B", "C"], drop=False)
assert tokenize(df) != tokenize(df2)
assert tokenize(df2) == tokenize(df2)


@pytest.mark.parametrize("preserve_index", [True, False])
def test_pyarrow_schema_dispatch(preserve_index):
from dask.dataframe.dispatch import (
pyarrow_schema_dispatch,
to_pyarrow_table_dispatch,
)

df = cudf.DataFrame(np.random.randn(10, 3), columns=list("abc"))
df["d"] = cudf.Series(["cat", "dog"] * 5)
table = to_pyarrow_table_dispatch(df, preserve_index=preserve_index)
schema = pyarrow_schema_dispatch(df, preserve_index=preserve_index)

assert schema.equals(table.schema)