Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default impl for WindowUDFImpl::expressions should use all input expressions #13168

Closed
Michael-J-Ward opened this issue Oct 29, 2024 · 1 comment · Fixed by #13169
Closed
Labels
enhancement New feature or request

Comments

@Michael-J-Ward
Copy link
Contributor

Is your feature request related to a problem or challenge?

#12857 added the expressions method to the WindowUDFImpl trait.

The default impl currently only takes the first input_expr and discards the rest.

fn expressions(&self, expr_args: ExpressionArgs) -> Vec<Arc<dyn PhysicalExpr>> {
expr_args
.input_exprs()
.first()
.map_or(vec![], |expr| vec![Arc::clone(expr)])
}

This altered behavior caused a regression in a two column Window UDF used in datafusion-python tests. See this comment detailing the error and this commit that fixes it.

Describe the solution you'd like

The default behavior should just return all the input expressions.

Describe alternatives you've considered

I didn't find any justification for only taking the first input_expr as the default behavior, so probably I'm missing something.

@jcsherin - please let me know if there's a reason that I'm missing for this behavior.

Additional context

No response

@jcsherin
Copy link
Contributor

In lead/lag user-defined window functions which accepts 1-3 arguments, the shift_offset (2nd argument) and default_value (3rd argument) are saved as fields of WindowShiftEvaluator struct which implements PartitionEvaluator.

struct WindowShiftEvaluator {
shift_offset: i64,
default_value: ScalarValue,
ignore_nulls: bool,
// VecDeque contains offset values that between non-null entries
non_null_offsets: VecDeque<usize>,
}

The arguments are parsed once and cached when the partition evaluator executes. So this allows correct operation even though expressions() returns only the first input expression to the partition evaluator.

Now based on your example downstream I see that this API is overly restrictive. I agree that the default behavior needs to be where expressions() returns all the input expressions.

Apologies for the time you spent investigating this issue 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants