Dev/xinli/arrow udf poc #2

xinlifoobar · 2024-07-15T13:36:56Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

* Port `bool_and` and `bool_or` to `AggregateUDFImpl` * Remove trait methods with default implementation * Add `bool_or_udaf` * Register `bool_and` and `bool_or` * Remove from `physical-expr` * Add expressions to logical plan roundtrip test * minor: remove methods with default implementation * Removes redundant tests * Removes hard-coded function names

…r/src/analysis.rs (apache#10992) * propogate error instead of panicking * use macro for creating internal df error

…esult` (apache#11003)

* feat: propagate empty for more join types * feat: update subquery de-correlation test * tests: simplify tests * refactor: better name * style: clippy * refactor: update tests * refactor: rename * refactor: fix spellings * add slt tests

* Add drop_columns to dataframe api apache#11007 * Prettier cleanup * Added additional drop_columns tests and fixed issue with nonexistent columns.

* push down non-unnest only Signed-off-by: jayzhan211 <[email protected]> * add doc Signed-off-by: jayzhan211 <[email protected]> * add doc Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * rewrite unnest push donw filter Signed-off-by: jayzhan211 <[email protected]> * remove comment Signed-off-by: jayzhan211 <[email protected]> * avoid double recurisve Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>

* feat: add temporal_coercion check * fix: add return stmt * chore: add slts * fix: remove println * Update datafusion/expr/src/type_coercion/binary.rs --------- Co-authored-by: Andrew Lamb <[email protected]>

* Deprecate OptimizerRule::try_optimize * optimize_children * Apply review suggestions * Fix clippy lint

* Minor changes * Minor changes * Re-introduce group by expression check

* compute gcd with unsigned ints * add test for the i64::MAX cases * move unsigned_abs below zero test to remove unnecessary casts * add slt test for gcd on max values instead of unit tests

* Add distinct_on to dataframe api apache#11011 * cargo fmt * Update datafusion/core/src/dataframe/mod.rs as per reviewer feedback Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>

…mestamp to timezone (apache#11056)

* test and implement boolean data page statistics * left out a collect & forgot to change the Check to Both * Update datafusion/core/src/datasource/physical_plan/parquet/statistics.rs --------- Co-authored-by: Andrew Lamb <[email protected]>

* push down non-unnest only Signed-off-by: jayzhan211 <[email protected]> * add doc Signed-off-by: jayzhan211 <[email protected]> * to lowercase Signed-off-by: jayzhan211 <[email protected]> * fix tpch Signed-off-by: jayzhan211 <[email protected]> * Update test * fix test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

…ical-expr dependency for `datafusion-function` crate (apache#11061) * mv to expr Signed-off-by: jayzhan211 <[email protected]> * upd lock Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>

…e#11046) * wip Signed-off-by: Kevin Su <[email protected]> * add a test Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]>

* feat: Add method to add analyzer rules to SessionContext Signed-off-by: Kevin Su <[email protected]> * Add a test Signed-off-by: Kevin Su <[email protected]> * Add analyze_plan Signed-off-by: Kevin Su <[email protected]> * update test Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

…pache#11041) * Fix: Sort Merge Join crashes on TPCH Q21 * Fix LeftAnti SMJ join when the join filter is set * rm dbg * Minor: disable fuzz test to avoid CI spontaneous failures * Minor: disable fuzz test to avoid CI spontaneous failures * Fix: Sort Merge Join crashes on TPCH Q21 * Fix LeftAnti SMJ join when the join filter is set * rm dbg * Minor: disable fuzz test to avoid CI spontaneous failures * Minor: disable fuzz test to avoid CI spontaneous failures * Minor: Add routine to debug join fuzz tests * Minor: Add routine to debug join fuzz tests * Minor: Add routine to debug join fuzz tests * Minor: Add routine to debug join fuzz tests * Minor: Add routine to debug join fuzz tests * SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join * SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join * SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join

@Weijun-H

apache#10701) * Add `advanced_parquet_index.rs` example of indexing into parquet files * pre-load page index * fix comment * Apply suggestions from code review Thank you @Weijun-H Co-authored-by: Alex Huang <[email protected]> * Add ASCII ART * Update datafusion-examples/README.md Co-authored-by: Alex Huang <[email protected]> * Update datafusion-examples/examples/advanced_parquet_index.rs Co-authored-by: Alex Huang <[email protected]> * Improve / clarify comments based on review * Add page index caveat --------- Co-authored-by: Alex Huang <[email protected]>

…he#10948) * Add Expr::column_refs to find column references without copying migrate some uses of to_column * Simplify condition

… duplicated custom implementations (apache#11059)

* Fix sink output schema being passed in to `FileSinkExec` where input schema was expected * Propagate CSV options (quote, double quote, and escape) through protos * Add test for double quotes * Test quote escape when double quotes are disabled * regen --------- Co-authored-by: svranesevic <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

* Draft parse_sql * Allow stirng pass * Complete sql to expr support * Add examples * Add unit tests * Fix format * Remove async for trival operation and add parquet demo * Fix comments * fix comments * fix comments * Fix doc link

* Support dictionary data type in array_to_string * Fix import * Some tests * Update datafusion/functions-array/src/string.rs Co-authored-by: Alex Huang <[email protected]> * Add some tests showing incorrect results * Get logical array * apply rust fmt * Simplify implementation, avoid panics --------- Co-authored-by: Alex Huang <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

* Implement min/max for interval types * Add sqllogictests for min/max intervals * Add tests for interval min/max * update sql logic tests --------- Co-authored-by: Andrew Lamb <[email protected]>

* add avg udaf * remove avg from expr * add test stub * migrate avg udaf * change avg udaf signature remove avg phy expr * fix tests * fix state_fields fn * fix ut in phy-plan aggr * refactor Average to Avg * refactor Average to Avg * fix type coercion tests * fix example and logic tests * fix py expr failing ut * update docs * fix failing tests * formatting examples * remove duplicate code and fix uts * addressing PR comments * add ut for logical avg window * fix physical plan roundtrip_window test case

* feat(11344): track memory used for non-parallel writes * feat(11344): track memory usage during parallel writes * test(11344): create bounded stream for testing * test(11344): test ParquetSink memory reservation * feat(11344): track bytes in file writer * refactor(11344): tweak the ordering to add col bytes to rg_reservation, before selecting shrinking for data bytes flushed * refactor: move each col_reservation and rg_reservation to match the parallelized call stack for col vs rg * test(11344): add memory_limit enforcement test for parquet sink * chore: cleanup to remove unnecessary reservation management steps * fix: fix CI test failure due to file extension rename

…ache#11337)

* Change no-statement error message to be clearer and add tests for said change * Run fmt to pass CI

apache#11299) * change array agg semantic for empty result Signed-off-by: jayzhan211 <[email protected]> * return null Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> * fix order sensitive Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> * add more test Signed-off-by: jayzhan211 <[email protected]> * fix null Signed-off-by: jayzhan211 <[email protected]> * fix multi-phase case Signed-off-by: jayzhan211 <[email protected]> * add comment Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * fix clone Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>

…ments (apache#11391) * Minor: return "not supported" for COUNT DISTINCT with multiple arguments * update condition

* update tests * update tests * add rustdoc * update PartialEq impl * fix * address feedback about improving api

Amends apache#11394 (sorry, I should have reviewed that). While reporting "not implemented" for "multiple statements" seems reasonable, I think the user should get a plan error (which roughly translates to "invalid argument") if they don't provide any statement. I don't see any reasonable way to support "no statement" ever, hence "not implemented" seems like a wrong promise.

* feat: add UDF `to_local_time()` * chore: support column value in array * chore: lint * chore: fix conversion for us, ms, and s * chore: add more tests for daylight savings time * chore: add function description * refactor: update tests and add examples in description * chore: add description and example * chore: doc chore: doc chore: doc chore: doc chore: doc * chore: stop copying * chore: fix typo * chore: mention that the offset varies based on daylight savings time * refactor: parse timezone once and update examples in description * refactor: replace map..concat with flat_map * chore: add hard code timestamp value in test chore: doc chore: doc * chore: handle errors and remove panics * chore: move some test to slt * chore: clone time_value * chore: typo --------- Co-authored-by: Andrew Lamb <[email protected]>

* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser

* chore: add document for `to_local_time()` * chore: feedback Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>

* move overlay to expr planner * typo

…pache#11386)

* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs

* tmp * opt * modify test * add another version * implement make_map function * implement make_map function * implement map function * format and modify the doc * add benchmark for map function * add empty end-line * fix cargo check * update lock * upate lock * fix clippy * fmt and clippy * support FixedSizeList and LargeList * check type and handle null array in coerce_types * make array value throw todo error * fix clippy * simpify the error tests

…valuated stats (apache#11357) * Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats * remove expression tree hashing as no longer needed * address review comments * add negative tests

* fix(11397): do not surface errors for closed channels, and instead let the task join errors be surfaced * fix(11397): terminate early on channel send failure

jcsherin and others added 30 commits June 20, 2024 06:59

feat: support uint data page extraction (apache#11018)

58d23c5

propagate error instead of panicking on out of bounds in physical-exp…

5316278

…r/src/analysis.rs (apache#10992) * propogate error instead of panicking * use macro for creating internal df error

Minor: Add more docs and examples for Transformed and `TransformedR…

1155b0b

…esult` (apache#11003)

doc: Update links in the documantation (apache#11044)

1f3ba11

Add drop_columns to dataframe api (apache#11010)

5498a02

* Add drop_columns to dataframe api apache#11007 * Prettier cleanup * Added additional drop_columns tests and fixed issue with nonexistent columns.

Consider timezones with UTC and +00:00 to be the same (apache#10960)

4a0c7f3

* feat: add temporal_coercion check * fix: add return stmt * chore: add slts * fix: remove println * Update datafusion/expr/src/type_coercion/binary.rs --------- Co-authored-by: Andrew Lamb <[email protected]>

Deprecate OptimizerRule::try_optimize (apache#11022)

6dffc53

* Deprecate OptimizerRule::try_optimize * optimize_children * Apply review suggestions * Fix clippy lint

Relax combine partial final rule (apache#10913)

098ba30

* Minor changes * Minor changes * Re-introduce group by expression check

Compute gcd with u64 instead of i64 because of overflows (apache#11036)

8aad936

* compute gcd with unsigned ints * add test for the i64::MAX cases * move unsigned_abs below zero test to remove unnecessary casts * add slt test for gcd on max values instead of unit tests

Add distinct_on to dataframe api (apache#11012)

30a6ed5

* Add distinct_on to dataframe api apache#11011 * cargo fmt * Update datafusion/core/src/dataframe/mod.rs as per reviewer feedback Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>

chore: add test to show current behavior of string to timezone vs. ti…

ce4940d

…mestamp to timezone (apache#11056)

Using display_name for Expr::Aggregation (apache#11020)

a4799c0

Support to unparse ScalarValue::TimestampMillisecond to String (apach…

81611ad

…e#11046) * wip Signed-off-by: Kevin Su <[email protected]> * add a test Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]>

support to unparse interval to string (apache#11065)

8a98307

Add Expr::column_refs to find column references without copying (apac…

98373ab

…he#10948) * Add Expr::column_refs to find column references without copying migrate some uses of to_column * Simplify condition

Give OptimizerRule::try_optimize default implementation and cleanup…

9f8b731

… duplicated custom implementations (apache#11059)

Support parsing SQL strings to Exprs (apache#10995)

6f10dbc

* Draft parse_sql * Allow stirng pass * Complete sql to expr support * Add examples * Add unit tests * Fix format * Remove async for trival operation and add parquet demo * Fix comments * fix comments * fix comments * Fix doc link

Implement min/max for interval types (apache#11015)

c2ea6b3

* Implement min/max for interval types * Add sqllogictests for min/max intervals * Add tests for interval min/max * update sql logic tests --------- Co-authored-by: Andrew Lamb <[email protected]>

lewiszlw and others added 22 commits July 10, 2024 18:58

Enable clone_on_ref_ptr clippy lint on common (apache#11384)

585504a

Minor: remove clones and unnecessary Arcs in from_substrait_rex (ap…

32cb3c5

…ache#11337)

Minor: Change no-statement error message to be clearer (apache#11394)

cc7484e

* Change no-statement error message to be clearer and add tests for said change * Run fmt to pass CI

Minor: return "not supported" for COUNT DISTINCT with multiple argu…

7a23ea9

…ments (apache#11391) * Minor: return "not supported" for COUNT DISTINCT with multiple arguments * update condition

feat: Add fail_on_overflow option to BinaryExpr (apache#11400)

2413155

* update tests * update tests * add rustdoc * update PartialEq impl * fix * address feedback about improving api

Enable clone_on_ref_ptr clippy lint on sql (apache#11380)

ed65c11

Move configuration information out of example usage page (apache#11300)

0b2eb50

reuse a single function to create the tpch test contexts (apache#11396)

faa1e98

Add to_local_time() in function reference docs (apache#11401)

e19dd2d

* chore: add document for `to_local_time()` * chore: feedback Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>

Move overlay planning toExprPlanner (apache#11398)

4402a1a

* move overlay to expr planner * typo

Coerce types for all union children plans when eliminating nesting (a…

d314ced

…pache#11386)

Add customizable equality and hash functions to UDFs (apache#11392)

4bed04e

* Add customizable equality and hash functions to UDFs * Improve equals and hash_value documentation * Add tests for parameterized UDFs

fix(11397): surface proper errors in ParquetSink (apache#11399)

1dfac86

* fix(11397): do not surface errors for closed channels, and instead let the task join errors be surfaced * fix(11397): terminate early on channel send failure

stage progress

983664a

stage progress

1676e93

github-actions bot added sql logical-expr physical-expr optimizer core substrait sqllogictest labels Jul 15, 2024

xinlifoobar closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/xinli/arrow udf poc #2

Dev/xinli/arrow udf poc #2

xinlifoobar commented Jul 15, 2024

Dev/xinli/arrow udf poc #2

Dev/xinli/arrow udf poc #2

Conversation

xinlifoobar commented Jul 15, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?