Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the to_timestamp* functions to datafusion-functions #9388

Merged
merged 26 commits into from
Mar 1, 2024

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Feb 28, 2024

Which issue does this PR close?

Closes #9291

Rationale for this change

#9285

What changes are included in this PR?

code, tests

Are these changes tested?

Yes

Are there any user-facing changes?

Functions moved to new crate but are enabled by default.

@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 28, 2024
@@ -79,6 +86,43 @@ fn schema() -> DFSchema {
.unwrap()
}

fn test_table_scan() -> LogicalPlan {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were migrated from the optimizer/simplify_expressions file

];
b.iter(|| {
for i in inputs.iter() {
black_box(to_timestamp(vec![i.clone()]));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks had to change to test the public api vs the previous version that tested the internal api.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing the public API is actually a more real-world comparison (so this seems like an improvement to me)

mod to_timestamp;

// create UDFs
make_udf_function!(to_timestamp::ToTimestampFunc, TO_TIMESTAMP, to_timestamp);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because to_timestamp functions are nary functions the export_functions! macro could not be used. Instead this is the manual equivalent of what it expands out to.

@Omega359 Omega359 marked this pull request as ready for review February 28, 2024 20:25
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 -- this is looking pretty sweet

"to_timestamp",
vec![lit("2020-09-08T12:00:00+00:00")],
)?);
let expr = col("ts").eq(to_timestamp(vec![lit("2020-09-08T12:00:00+00:00")]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/core/tests/parquet/page_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/simplification.rs Show resolved Hide resolved
datafusion/core/tests/sql/explain_analyze.rs Outdated Show resolved Hide resolved
const ERR_NANOSECONDS_NOT_SUPPORTED: &str = "The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804";

#[derive(Debug)]
pub(super) struct ToTimestampFunc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all so similar, I wonder if we could make a common class to avoid the duplication something like

struct ToTimestampImpl {
  /// target datatype
  data_type: DataType,
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ScalarUDFImpl is different for each 'version' of that struct. It would be possible to have a single struct but then the impl would have to match on the data_type (really in this case I think it might be timeunit). I'll do a test of that tonight to see if it's cleaner overall.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is required, just a suggestion I thought of while reviewing this code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I just looked into it and it's actually more than the impl methods because it would affect the make_udf_function macro as well (or a clone) to allow args to be passed into the ::new() call. Possible for a future improvement

use datafusion_expr::ColumnarValue;

use crate::expressions::cast_column;

/// Error message if nanosecond conversion request beyond supported interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great to extract all this code out of datafusion/physical-expr

@@ -483,7 +483,7 @@ statement error Did you mean 'arrow_typeof'?
SELECT arrowtypeof(v1) from test;

# Scalar function
statement error Did you mean 'to_timestamp_seconds'?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 this looks like there may be special case handling for function hinting that doesn't work for user defined functions.

I'll file a ticket to add that feature to user defined functions too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9392

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #9407

test_evaluate(
(lit("foo").not_eq(lit("foo"))).or(col("c").eq(lit(1))),
// lit(false).or(col("c").eq(lit(1))),
col("c").eq(lit(1)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was the version that was in expr_simplifier but that no longer passed the test once the test_evaluate_with_start_time function was changed to use ExprSimplifier from ConstEvaluator. I am unsure if this is in fact correct or not @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks better to me (false OR ...) is always (...).

The reason that the test changed is that previously it was only testing the constant propagation, but now it is also testing the simplifier code as well. I think this looks good

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 -- this looks just about perfect. There are a few tests I want to revert but I will handle doing so and push changes to this PR

test_evaluate(
(lit("foo").not_eq(lit("foo"))).or(col("c").eq(lit(1))),
// lit(false).or(col("c").eq(lit(1))),
col("c").eq(lit(1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks better to me (false OR ...) is always (...).

The reason that the test changed is that previously it was only testing the constant propagation, but now it is also testing the simplifier code as well. I think this looks good

datafusion/core/tests/simplification.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
];
b.iter(|| {
for i in inputs.iter() {
black_box(to_timestamp(vec![i.clone()]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing the public API is actually a more real-world comparison (so this seems like an improvement to me)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am merging up from main to resolve conflicts in this PR now

@alamb
Copy link
Contributor

alamb commented Feb 29, 2024

I am merging up from main to resolve conflicts in this PR now

I tried with this for a bit but ran out of time. @Omega359 I think the issue is that the to_date function got added and needs to be ported as well as its implementation depends on the to_timestamp code.

Is this something you have time to do? If not, I will find time to do it myself.

@Omega359
Copy link
Contributor Author

I tried with this for a bit but ran out of time. @Omega359 I think the issue is that the to_date function got added and needs to be ported as well as its implementation depends on the to_timestamp code.

Is this something you have time to do? If not, I will find time to do it myself.

I can handle it, thanks for all your help. Likely tomorrow sometime.

@alamb
Copy link
Contributor

alamb commented Feb 29, 2024

BTW here is a proposed PR (targeting this PR) to avoid the new dependency: Omega359#1

Move `cast_column` to `ColumnarValue::cast_to`
# Conflicts:
#	datafusion/expr/src/expr_fn.rs
#	datafusion/functions/Cargo.toml
#	datafusion/physical-expr/src/datetime_expressions.rs
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Omega359 -- this is epic work

log = "0.4.20"
rand = { workspace = true }
Copy link
Contributor

@alamb alamb Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think several of these dependencies are only used in the tests (so they could be dev depenencies). I'll make a follow on PR clean it up as this one is already really large.

Update I pushed a commit to move then to dev dependencies

@alamb
Copy link
Contributor

alamb commented Mar 1, 2024

I think the only thing left here is to resolve the merge conflicts (which is due to to_date). Otherwise this PR is ready to go. Thanks again @Omega359

@@ -0,0 +1,389 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved functions common to to_timestamp and to_date to here

pub mod expr_fn {
use datafusion_expr::Expr;

/// ```ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunately but this rustdoc won't work as is because of the SessionContext::new() call is into a non-accessible module. I think we either need to pull examples out of rustdoc or find some other solution that doesn't involve circular dependencies. This is one of the more limiting aspects with rust documentation I've encountered.

Copy link
Contributor

@alamb alamb Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we probably need to pull them out of the rust doc directly (maybe we can put them in the examples 🤔 ) and maybe link to them from here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @Omega359 -- thanks again. Let's merge and iterate in follow on PRs

@alamb alamb merged commit 9e39afd into apache:main Mar 1, 2024
24 checks passed
@Omega359 Omega359 deleted the feature/9291 branch March 1, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move the to_timestamp* functions to datafusion-functions
2 participants