Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract polars function expression nodes to ensure they are serializable #17418

Merged
merged 14 commits into from
Nov 26, 2024

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Nov 22, 2024

Description

Use Enums to define Python types as references to polars.polars._expr_nodes.*Function as to ensure cudf_polars.dsl.expressions specializations of Expr are serializable.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pentschev pentschev added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 22, 2024
@pentschev pentschev requested a review from a team as a code owner November 22, 2024 15:12
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 22, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions

@@ -31,22 +30,49 @@
__all__ = ["BooleanFunction"]


class BooleanFunctionName(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: use IntEnum.

Also, let's nest this class inside the BooleanFunction class below, and just call it Name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3f69d98 .

Comment on lines 52 to 57
@staticmethod
def get_polars_type(tp: BooleanFunctionName):
function, name = str(tp).split(".")
if function != "BooleanFunction":
raise ValueError("BooleanFunction required")
return getattr(BooleanFunctionName, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't return a polars object, but one of ours. Also, because we're calling getattr, that suggests we should make this a classmethod

@classmethod
def from_polars(cls, obj: pl_expr.BooleanFunction) -> Self:
    function, name = str(obj).split(".", maxsplit=1)
    if function != "BooleanFunction":
        raise ValueError("BooleanFunction required")
    return getattr(cls, name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I totally mix everything up here, thanks for catching. Done in be7fa52 .

Comment on lines 49 to 75
if self.name == pl_expr.BooleanFunction.IsIn and not all(
if self.name == BooleanFunctionName.IsIn and not all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Because we have enums, we should now use is for comparison.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d009f9 .

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter, I think this makes sense. Should we add a test that these things are now pickleable, or does that need other bits and pieces?

@pentschev
Copy link
Member Author

Thanks Peter, I think this makes sense. Should we add a test that these things are now pickleable, or does that need other bits and pieces?

Thanks for the review Lawrence. I agree, tests are a good idea. I've made a small improvement in 9b54437 that I discovered could be better while writing some tests in 5a04207. Those tests check that serialization of the Name classes as well as the from_polars method indeed can be done correctly, but they do not check for the full serialization pipeline with Translator. Adding standalone tests for the full Translator pipeline seems significantly more complex, the existing tests seem to all use a GPUEngine for that which is, AFAIU, the only realistic way of using Translator, such cases will then be addressed in #17364 where I'll test the entire suite with the Distributed scheduler which implies serialization of the objects handled in this PR. With that said, do you think addressing the full pipeline in #17364 via the Distributed scheduler is sufficient or do you think adding some sort of mock testing covering the full Translator pipeline with serialization is necessary?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small further suggestions

Comment on lines 17 to 19
@pytest.mark.parametrize(
"function", [BooleanFunction, TemporalFunction, StringFunction]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's abstract this into a reusable fixture:

@pytest.fixture(params=["BooleanFunction", "StringFunction", "TemporalFunction"])
def function(request):
    return request.param

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done in fcf820f . I've kept the module references instead of using strings and resolved them in the tests with __name__, let me know if you have a strong preference for strings instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's fine.

python/cudf_polars/tests/dsl/test_serialization.py Outdated Show resolved Hide resolved
python/cudf_polars/tests/dsl/test_serialization.py Outdated Show resolved Hide resolved
@pentschev pentschev added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 26, 2024
@wence-
Copy link
Contributor

wence- commented Nov 26, 2024

/merge

@rapids-bot rapids-bot bot merged commit b89728b into rapidsai:branch-25.02 Nov 26, 2024
105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants