Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/xinli/arrow udf poc #2

Closed
wants to merge 209 commits into from
Closed

Dev/xinli/arrow udf poc #2

wants to merge 209 commits into from

Conversation

xinlifoobar
Copy link
Owner

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jcsherin and others added 30 commits June 20, 2024 06:59
* Port `bool_and` and `bool_or` to `AggregateUDFImpl`

* Remove trait methods with default implementation

* Add `bool_or_udaf`

* Register `bool_and` and `bool_or`

* Remove from `physical-expr`

* Add expressions to logical plan roundtrip test

* minor: remove methods with default implementation

* Removes redundant tests

* Removes hard-coded function names
…r/src/analysis.rs (apache#10992)

* propogate error instead of panicking

* use macro for creating internal df error
* feat: propagate empty for more join types

* feat: update subquery de-correlation test

* tests: simplify tests

* refactor: better name

* style: clippy

* refactor: update tests

* refactor: rename

* refactor: fix spellings

* add slt tests
* Add drop_columns to dataframe api apache#11007

* Prettier cleanup

* Added additional drop_columns tests and fixed issue with nonexistent columns.
* push down non-unnest only

Signed-off-by: jayzhan211 <[email protected]>

* add doc

Signed-off-by: jayzhan211 <[email protected]>

* add doc

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* rewrite unnest push donw filter

Signed-off-by: jayzhan211 <[email protected]>

* remove comment

Signed-off-by: jayzhan211 <[email protected]>

* avoid double recurisve

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
* feat: add temporal_coercion check

* fix: add return stmt

* chore: add slts

* fix: remove println

* Update datafusion/expr/src/type_coercion/binary.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>
* Deprecate OptimizerRule::try_optimize

* optimize_children

* Apply review suggestions

* Fix clippy lint
* Minor changes

* Minor changes

* Re-introduce group by expression check
* compute gcd with unsigned ints

* add test for the i64::MAX cases

* move unsigned_abs below zero test to remove unnecessary casts

* add slt test for gcd on max values instead of unit tests
* Add distinct_on to dataframe api apache#11011

* cargo fmt

* Update datafusion/core/src/dataframe/mod.rs as per reviewer feedback

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
* test and implement boolean data page statistics

* left out a collect & forgot to change the Check to Both

* Update datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>
* push down non-unnest only

Signed-off-by: jayzhan211 <[email protected]>

* add doc

Signed-off-by: jayzhan211 <[email protected]>

* to lowercase

Signed-off-by: jayzhan211 <[email protected]>

* fix tpch

Signed-off-by: jayzhan211 <[email protected]>

* Update test

* fix test

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
…ical-expr dependency for `datafusion-function` crate (apache#11061)

* mv to expr

Signed-off-by: jayzhan211 <[email protected]>

* upd lock

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
…e#11046)

* wip

Signed-off-by: Kevin Su <[email protected]>

* add a test

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
* feat: Add method to add analyzer rules to SessionContext

Signed-off-by: Kevin Su <[email protected]>

* Add a test

Signed-off-by: Kevin Su <[email protected]>

* Add analyze_plan

Signed-off-by: Kevin Su <[email protected]>

* update test

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
…pache#11041)

* Fix: Sort Merge Join crashes on TPCH Q21

* Fix LeftAnti SMJ join when the join filter is set

* rm dbg

* Minor: disable fuzz test to avoid CI spontaneous failures

* Minor: disable fuzz test to avoid CI spontaneous failures

* Fix: Sort Merge Join crashes on TPCH Q21

* Fix LeftAnti SMJ join when the join filter is set

* rm dbg

* Minor: disable fuzz test to avoid CI spontaneous failures

* Minor: disable fuzz test to avoid CI spontaneous failures

* Minor: Add routine to debug join fuzz tests

* Minor: Add routine to debug join fuzz tests

* Minor: Add routine to debug join fuzz tests

* Minor: Add routine to debug join fuzz tests

* Minor: Add routine to debug join fuzz tests

* SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join

* SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join

* SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join
apache#10701)

* Add `advanced_parquet_index.rs` example of indexing into parquet files

* pre-load page index

* fix comment

* Apply suggestions from code review

Thank you @Weijun-H

Co-authored-by: Alex Huang <[email protected]>

* Add ASCII ART

* Update datafusion-examples/README.md

Co-authored-by: Alex Huang <[email protected]>

* Update datafusion-examples/examples/advanced_parquet_index.rs

Co-authored-by: Alex Huang <[email protected]>

* Improve / clarify comments based on review

* Add page index caveat

---------

Co-authored-by: Alex Huang <[email protected]>
…he#10948)

* Add Expr::column_refs to find column references without copying

migrate some uses of to_column

* Simplify condition
* Fix sink output schema being passed in to `FileSinkExec` where input schema was expected

* Propagate CSV options (quote, double quote, and escape) through protos

* Add test for double quotes

* Test quote escape when double quotes are disabled

* regen

---------

Co-authored-by: svranesevic <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
* Draft parse_sql

* Allow stirng pass

* Complete sql to expr support

* Add examples

* Add unit tests

* Fix format

* Remove async for trival operation and add parquet demo

* Fix comments

* fix comments

* fix comments

* Fix doc link
* Support dictionary data type in array_to_string

* Fix import

* Some tests

* Update datafusion/functions-array/src/string.rs

Co-authored-by: Alex Huang <[email protected]>

* Add some tests showing incorrect results

* Get logical array

* apply rust fmt

* Simplify implementation, avoid panics

---------

Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
* Implement min/max for interval types

* Add sqllogictests for min/max intervals

* Add tests for interval min/max

* update sql logic tests

---------

Co-authored-by: Andrew Lamb <[email protected]>
* add avg udaf

* remove avg from expr

* add test stub

* migrate avg udaf

* change avg udaf signature
remove avg phy expr

* fix tests

* fix state_fields fn

* fix ut in phy-plan aggr

* refactor Average to Avg

* refactor Average to Avg

* fix type coercion tests

* fix example and logic tests

* fix py expr failing ut

* update docs

* fix failing tests

* formatting examples

* remove duplicate code and fix uts

* addressing PR comments

* add ut for logical avg window

* fix physical plan roundtrip_window test case
lewiszlw and others added 22 commits July 10, 2024 18:58
* feat(11344): track memory used for non-parallel writes

* feat(11344): track memory usage during parallel writes

* test(11344): create bounded stream for testing

* test(11344): test ParquetSink memory reservation

* feat(11344): track bytes in file writer

* refactor(11344): tweak the ordering to add col bytes to rg_reservation, before selecting shrinking for data bytes flushed

* refactor: move each col_reservation and rg_reservation to match the parallelized call stack for col vs rg

* test(11344): add memory_limit enforcement test for parquet sink

* chore: cleanup to remove unnecessary reservation management steps

* fix: fix CI test failure due to file extension rename
* Change no-statement error message to be clearer and add tests for said change

* Run fmt to pass CI
apache#11299)

* change array agg semantic for empty result

Signed-off-by: jayzhan211 <[email protected]>

* return null

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

* fix order sensitive

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

* add more test

Signed-off-by: jayzhan211 <[email protected]>

* fix null

Signed-off-by: jayzhan211 <[email protected]>

* fix multi-phase case

Signed-off-by: jayzhan211 <[email protected]>

* add comment

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* fix clone

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
…ments (apache#11391)

* Minor: return "not supported" for COUNT DISTINCT with multiple arguments

* update condition
* update tests

* update tests

* add rustdoc

* update PartialEq impl

* fix

* address feedback about improving api
Amends apache#11394 (sorry, I should have reviewed that).

While reporting "not implemented" for "multiple statements" seems
reasonable, I think the user should get a plan error (which roughly
translates to "invalid argument") if they don't provide any statement. I
don't see any reasonable way to support "no statement" ever, hence "not
implemented" seems like a wrong promise.
* feat: add UDF `to_local_time()`

* chore: support column value in array

* chore: lint

* chore: fix conversion for us, ms, and s

* chore: add more tests for daylight savings time

* chore: add function description

* refactor: update tests and add examples in description

* chore: add description and example

* chore: doc

chore: doc

chore: doc

chore: doc

chore: doc

* chore: stop copying

* chore: fix typo

* chore: mention that the offset varies based on daylight savings time

* refactor: parse timezone once and update examples in description

* refactor: replace map..concat with flat_map

* chore: add hard code timestamp value in test

chore: doc

chore: doc

* chore: handle errors and remove panics

* chore: move some test to slt

* chore: clone time_value

* chore: typo

---------

Co-authored-by: Andrew Lamb <[email protected]>
* initial prettier unparse

* bug fix

* handling minus and divide

* cleaning references and comments

* moved tests

* Update precedence of BETWEEN

* rerun CI

* Change precedence to match PGSQLs

* more pretty unparser tests

* Update operator precedence to match latest PGSQL

* directly prettify expr_to_sql

* handle IS operator

* correct IS precedence

* update unparser tests

* update unparser example

* update more unparser examples

* add with_pretty builder to unparser
* chore: add document for `to_local_time()`

* chore: feedback

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
* move overlay to expr planner

* typo
* Add customizable equality and hash functions to UDFs

* Improve equals and hash_value documentation

* Add tests for parameterized UDFs
* tmp

* opt

* modify test

* add another version

* implement make_map function

* implement make_map function

* implement map function

* format and modify the doc

* add benchmark for map function

* add empty end-line

* fix cargo check

* update lock

* upate lock

* fix clippy

* fmt and clippy

* support FixedSizeList and LargeList

* check type and handle null array in coerce_types

* make array value throw todo error

* fix clippy

* simpify the error tests
…valuated stats (apache#11357)

* Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats

* remove expression tree hashing as no longer needed

* address review comments

* add negative tests
* fix(11397): do not surface errors for closed channels, and instead let the task join errors be surfaced

* fix(11397): terminate early on channel send failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.