Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Port BuiltInFunctons to datafusion-functions-* crates #9285

Closed
17 of 43 tasks
alamb opened this issue Feb 20, 2024 · 41 comments · Fixed by #10098
Closed
17 of 43 tasks

[Epic] Port BuiltInFunctons to datafusion-functions-* crates #9285

alamb opened this issue Feb 20, 2024 · 41 comments · Fixed by #10098
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 20, 2024

Is your feature request related to a problem or challenge?

As part of making DataFusion even more customizable (#8045), it is valuable to let system designers mix and match different packages of functions to get the precise behavior they want (e.g. postgres style to_date or spark style to_date).

To support this functionality as well as to ensure the ScalarUDF API exposes the full power of DataFusion, we are in the process of extracting the "built in" functions out of the core and into separate crates.

This epic tracks the work to actually move the functions out of the core datafusion crate (spread through datafusion_expr and datafusion-physical-expr and into the new datafusion-functions / datafusion-functions-array crates

Tasks:

Here is list of many of the items necessary to complete this transition. Eventually there should be tickets for all tasks, and there are tickets for some already, but I don't want to make 100s of tickets all at once. I plan to make more as we make it through more of this project.

Anyone should feel free to make other tickets if they want to help with items below.

math_expressions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/math/mod.rs

array_expressions

Note that given the size and specialization of these functions are put in their own subcrate, datafusion-functions-array

Core functions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/core/mod.rs

crypto_expressions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/crypto/mod.rs

  • Create crypto module in datafusion/functions/src/crypto and crypto_expressions feature flag, move digest function
  • Digest, MD5, SHA224, SHA256, SHA384, SHA512

string_expressions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/string/mod.rs

  • concat, concat_ws, ends_with, initcap Move take concat, concat_ws, ends_with, initcap, to datafusion-functions #9540
  • Create string module in datafusion/functions/src/string and string_expressions feature flag, move ascii function
  • ascii, bit_length, btrim, chr,
  • instr, lower, ltrim, octet_length,
  • repeat, replace, rtrim, split_part,
  • starts_with, to_hex, trim, upper,
  • levenshtein, uuid, overlay

unicode_expressions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/unicode/mod.rs

  • Create unicode module in datafusion/functions/src/unicode and unicode_expressions feature flag, move charlength function
  • CharLength,
  • Left, Lpad, Reverse, Right, Rpad,
  • Strpos, Substr,
  • Translate, SubstrIndex, FindInSet

regex_expressions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/regexp/mod.rs

datetime_expressions

These should be located in the datafusion-functions crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/datetime/mod.rs

  • Create datetime module in datafusion/functions/src/datetime and datetime_expressions feature flag, move date_part
  • Move the to_timestamp* functions to datafusion-functions #9291
  • port benchmarks to datafusion-functions crate
  • date_part, date_trunc, date_bin,
  • to_timestamp, to_timestamp_millis, to_timestamp_micros, to_timestamp_nanos, to_timestamp_seconds,
  • from_unixtime, now, current_date, current_time

Infrastructure

Describe alternatives you've considered

No response

Additional context

The organization was discussed in #9100

@SteveLauC
Copy link
Contributor

SteveLauC commented Feb 21, 2024

Let me give this a try:

math_expressions

[ ] Abs, Acos, Asin,

Update: Abs has been taken in #9286, I will handle the remaining 2 fns

@Lordworms
Copy link
Contributor

I'll try to do regex_expressions parts.

@guojidan
Copy link
Contributor

I'll try to do array_expressions parts.

@yyy1000
Copy link
Contributor

yyy1000 commented Feb 24, 2024

I filed #9336 which I think may be necessary before we port functions.

@jayzhan211
Copy link
Contributor

work on array_concat

@guojidan
Copy link
Contributor

work on array_concat

hi @jayzhan211 I already work on array_concat #9343

@jayzhan211
Copy link
Contributor

Oh, sorry. I will work on other

@jayzhan211
Copy link
Contributor

Take ArrayHas, ArrayHasAll, ArrayHasAny

@SteveLauC
Copy link
Contributor

Take Atan, Atan2, Acosh.

@Weijun-H
Copy link
Member

Weijun-H commented Feb 29, 2024

Take ArrayPopFront, ArrayPopBack, ArrayDistinct, ArrayElement

@Weijun-H
Copy link
Member

Weijun-H commented Mar 2, 2024

Take ArrayDims, ArrayNdims, Cardinality, ArrayNdims
#9425

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 4, 2024

Take ArrayHas, ArrayHasAll, ArrayHasAny

I think this needs to wait until make_array is done and also move to rewrite to the optimizer.

Take ArrayIntersect, ArrayUnion, ArrayExcept

Edit: general_set_op needs make_array too

Take: ArraySlice + ArrayElement

@Weijun-H
Copy link
Member

Weijun-H commented Mar 30, 2024

take Atan, Acosh, Asinh, Atanh, #9872

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this issue May 8, 2024
`datafusion` completed an Epic that ported many of the `BuiltInFunctions` enum to `SclarUDF`.

I created new macros to simplify the port, and used these macros to refactor a few existing functions.

Ref: apache/datafusion#9285
andygrove pushed a commit to apache/datafusion-python that referenced this issue May 8, 2024
* deps: upgrade datafusion to 37.1.0

* feat: re-implement SessionContext::tables

The method was removed upstream but is used in many tests for `datafusion-python`.

Ref: apache/datafusion#9627

* feat: upgrade dataframe write_parquet and write_json

The options to write_parquet changed.

write_json has a new argument that I defaulted to None. We can expose that config later.

Ref: apache/datafusion#9382

* feat: impl new ExecutionPlanProperties for DatasetExec

Ref: apache/datafusion#9346

* feat: add upstream variant and method params

- `WindowFunction` and `AggregateFunction` have `null_treatment` options.
- `ScalarValue` and `DataType` have new variants
- `SchemaProvider::table` now returns a `Result`

* lint: allow(deprecated) for make_scalar_function

* feat: migrate functions.rs

`datafusion` completed an Epic that ported many of the `BuiltInFunctions` enum to `SclarUDF`.

I created new macros to simplify the port, and used these macros to refactor a few existing functions.

Ref: apache/datafusion#9285

* fixme: commented out last failing test

This is a bug upstream in datafusion

FAILED datafusion/tests/test_functions.py::test_array_functions - pyo3_runtime.PanicException: range end index 9 out of range for slice of length 8

* chore: update Cargo.toml package info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.