Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: DataFrameTreeReduction is no longer part of dask #566

Merged
merged 6 commits into from
Jan 27, 2025

Conversation

lgray
Copy link
Collaborator

@lgray lgray commented Jan 18, 2025

This class was removed from dask itself, and dask-awkward is presently broken with latest dask because of that.

@lgray
Copy link
Collaborator Author

lgray commented Jan 18, 2025

@douglasdavis @martindurant now that the legacy dask dataframe is no longer there, this causes a bunch of tests to fail as well, and forces dependency on dask-expr.

@martindurant
Copy link
Collaborator

Is the following enough for those failures:

from dask.dataframe.core import DataFrame
-> 
from dask.dataframe import DataFrame

?

@ikrommyd
Copy link
Contributor

Is the following enough for those failures:

from dask.dataframe.core import DataFrame
-> 
from dask.dataframe import DataFrame

?

It should be. This import was removed recently and works fine for me locally

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask.dataframe.core import DataFrame
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'DataFrame' from 'dask.dataframe.core' (/Users/iason/miniforge3/envs/egamma-dev/lib/python3.12/site-packages/dask/dataframe/core.py)
>>> from dask.dataframe import DataFrame
>>> exit()

However, please make sure that from dask.dataframe import DataFrame is compatible with older versions of dask/distributed. We typically run on clusters that don't have the latest versions available.

@martindurant
Copy link
Collaborator

However, please make sure that from dask.dataframe import DataFrame is compatible with older versions of dask/distributed.

Yes, it has been this way a long time.

@martindurant
Copy link
Collaborator

from dask.dataframe import DataFrame

On further reading, this is no longer the same DataFrame as before at all ("DataFrame-like Expr Collection."). There is still a function, dask.dataframe.from_graph that supports making a thing, that might be what we need. What a mess!

(I guess akimbo.dask is broken too)

@martindurant
Copy link
Collaborator

@lgray , let me know when you are working on this. I suppose dask-histogram is broken too; but the copy of tree-reduction should live here, so I'll wait for this PR and then fix the other.

@martindurant
Copy link
Collaborator

@lgray , the following gets us some of the way there: itsuccessfully produces output, but it's totally wrong (you get a pd.Series with one row containing all the data as lists). I think it's materialising the graph to low-level in dask.dataframe.dask_expr.io.ioFromGraph._layer.

--- a/src/dask_awkward/lib/io/io.py
+++ b/src/dask_awkward/lib/io/io.py
@@ -42,7 +42,7 @@ from dask_awkward.utils import first, second
 if TYPE_CHECKING:
     from dask.array.core import Array as DaskArray
     from dask.bag.core import Bag as DaskBag
-    from dask.dataframe.core import DataFrame as DaskDataFrame
+    from dask.dataframe import DataFrame as DaskDataFrame
     from dask.delayed import Delayed
     from fsspec.spec import AbstractFileSystem

@@ -466,8 +466,8 @@ def to_dataframe(

     """
     import dask
-    from dask.dataframe.core import DataFrame as DaskDataFrame
-    from dask.dataframe.core import new_dd_object
+    from dask.dataframe import DataFrame as DaskDataFrame
+    from dask.dataframe import from_graph as new_dd_object

     if optimize_graph:
         (array,) = dask.optimize(array)
@@ -479,13 +479,15 @@ def to_dataframe(
         **kwargs,
     )
     meta = ak.to_dataframe(length_zero_if_typetracer(array._meta), **kwargs)
+    # need _parameters = ["layer", "_meta", "divisions", "keys", "name_prefix"]
     return cast(
         DaskDataFrame,
         new_dd_object(
             intermediate.dask,
-            intermediate.name,
-            meta,
-            intermediate.divisions,
+            name_prefix=intermediate.name,
+            _meta=meta,
+            divisions=intermediate.divisions,
+            keys=list(intermediate.dask)  # WHY MATERIALIZE???
         ),
     )

@lgray
Copy link
Collaborator Author

lgray commented Jan 23, 2025

Waiting on this and then fixing dask-histogram makes sense. I put an version cap on dask in coffea to keep things working for people in the mean time.

@henryiii
Copy link

henryiii commented Jan 23, 2025

Ran into this with Hist. It's quite a mess, as limiting Dask forces limiting numpy<2, which then forces Python <3.13. Finally got something working with uv run -p 3.12 --with 'numpy~=1.26.0'.

@martindurant
Copy link
Collaborator

You should only need to go back one release of dask, which definitely did work with numpy 2. Perhaps uv is not doing a good job of resolving this.

henryiii added a commit to scikit-hep/hist that referenced this pull request Jan 23, 2025
See dask-contrib/dask-awkward#566. This needs to
be fixed, limiting is temporary!
@henryiii
Copy link

Ah, okay, clearing the lock file did get it to get the latest 2024, which works.

@martindurant
Copy link
Collaborator

In the dask-awkward meeting, we decided that we should get a fix out for the core functionality here and in dask-histogram, and not worry too much about to_dataframe. In that method, we can do a version check and run the old code if it's still available, or raise NotImplemented("This only works with dask version<2025"). With a TODO to actually fix this later.

OK with you, @lgray ?

@lgray
Copy link
Collaborator Author

lgray commented Jan 24, 2025

That's fine by me - we don't do much dak < - > dd interop in the first place.

@martindurant
Copy link
Collaborator

Recommendation here to use from_map dask/dask#11678 (comment) , but I don't think that will work for us.

I pushed the changes, should pass all except the one test that imports dask-histogram. Assuming all is well, I will merge and try to get around to fixing that too and releasing by Monday.

@martindurant
Copy link
Collaborator

mindeps and runs with old enough python are passing, the remaining have only the one expected failure.

@martindurant martindurant merged commit c3c2c4e into main Jan 27, 2025
10 of 25 checks passed
@martindurant martindurant deleted the tree-reduction-fix branch January 27, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants