15.0.0 (2022-12-01)
Breaking changes:
- Expose remaining parquet config options into ConfigOptions (try 2) #4427 (alamb)
- Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization #4382 (alamb)
- add
{TDigest,ScalarValue,Accumulator}::size
#4342 (crepererum) - API-break: Support
SubqueryAlias
and removeAlias in Projection
#4333 [sql] (jackwener) - split
try_new_with_schema_alias
from original code #4284 (jackwener) - Collapse statistics in normal explain plan #4157 (alamb)
- Linearize binary expressions to reduce proto tree complexity #4115 (isidentical)
- support
SET Timezone
#4107 [sql] (waitingkuo)
Implemented enhancements:
- Refactor Built-in, Aggregate window functions to increase code reuse. #4440
- Helper to get "root" error #4435
- Do NOT convert intermediate/source errors to strings. #4434
- Estimate the
total_byte_size
of the filter expression's result when selectivity is available #4374 - refactor the code of the
HashJoin
#4356 CoalesceBatchesExec
reports no ordering #4331- Introduce tournament tree to achieve better k-way sort-merging #4300
- Add a checker to confirm physical optimizer rules will keep the physical plan schema immutable #4299
- Remove the macro rule
unary_scalar_expr
fromexpr_fn.rs
#4298 - Remove Alias-in-Projection, replace it with
SubqueryAlias
#4291 - reimplement
reduce_outer_join
#4270 - Reimplement
filter_push_down
#4266 - Reimplement
eliminate_limit
#4264 - Reimplement
limit_push_down
#4263 - Make a data driven SQL testing tool (so we can reuse duckdb test suite, example) #4248
- upgrade chrono to 0.4.23 #4224
- support scan non-string columns partitioned parquet files #4218
- Allow optimizer rules to skip optimizing plans #4209
- Supporting specifying schema when create tables #4183
- Improve ergonomics of creating
ListingOptions
#4178 - Add ability to specify external sort information for ParquetExec #4169
- Add another method to collect referenced columns from an expression #4152
- Improve
EXPLAIN ANALYZE
output for parquet exec #4144 TableProviderFactory::create
should haveOptional<DFSchemaRef>
parameter #4142- Support more expressions in equality join #4140
- JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats #4139
- Allow TPCH tooling to create a combined result for easier processing by outside tools #4127
- Allow additional options when creating an external table #4125
- reuse code utils::optimize_children instead of redundant implementation #4120
- Add test field to PR template #4113
- Allow for automatic registration of
ListingTables
#4111 - Add CI check that configs.md is up-to-date #4108
- Support
SET
timezone to non-UTC time zone #4106 - Parquet predicates contains
and true
expressions #4091 - Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4077
- add support for
.xz
compressed files #4074 - add a feature gate to make support for compressed files optional #4073
- Support serializing more deeply nested AND / OR expressions #4066
- Use f64::total_cmp instead of OrderedFloat #4051
- Add documentation to make it clear that decimal support is still experimental #4036
- Simplify Pushed Down Predicates #4020
- Improve HashJoinExec metrics #4009
- Move physical plan serde from Ballista to DataFusion #3949
- Support
SubqueryAlias
better in planner #3927 - A framework for expression boundary analysis (and statistics) #3898
- Replace
Filter: Boolean(false)
withEmptyRelation
#3864 - Implement statistics estimation for
FilterExec
#3845 - Support parquet page filtering for more types: String, Binary(Decimal), Int96 #3833
- Allow configuring parquet filter pushdown dynamically #3821
- Unable to register tables in non-cloud S3 servers #3640
- support more data type in prune for cast/try_cast #3442
- Disable spill to disk globally #3264
- Consider to categorize Operator #3216
- Replace Projection.alias with SubqueryAlias #2212
- [Optimizer] Eliminate the distinct #2045
- beautify datafusion's site: https://datafusion.apache.org/ #1819
- split datafusion-logical-plan sub-module #1755
- convert
outer join
toinner join
to improve performance #1585 - Add sqllogictest for datafusion #1453
- Add additional simplification rules #1406
- support more subqueries #1209
- Add baseline metrics for remaining execution plan nodes #1019
- Make
ExecutionPlan
implementations immutable #987 - Architecture overview may be insufficient in README #980
- Add a separate configuration setting for parallelism of scanning parquet files #924
- Support hash repartion elimination #41
Fixed bugs:
pyarrow
CI failed #4448UnwrapCastInComparison
exist bug #4430- The CLI panics when passing an invalid
explain
query #4378 - HashJoin should return Err when the right side input stream produce Err #4362
- Optimizer check errors if resulting schema has different metadata #4346
- Panic with function
to_hex
#4339 LimitPushDown
pushdown into limit, result is wrong #4308- DESCRIBE statement issue with qualified table references #4303
- Panic with window function LAST_VALUE #4297
- CI failed in
Compare to postgres
#4294 - Field alias can't work in where clause #4288
- Some valid filters are not pushed down to parquet scan #4282
- The type renaming
pub type NullColumnarValue = ColumnarValue
makes no sense #4271 - Current
limit_push_down
can't support cross_join #4256 - Cargo test fail #4253
- RightSemi/RightAnti HashJoin has bug, the left_indices is never populated, causing failure to apply join filters. #4247
- Clippy failures #4245
- Cannot query s3 data from datafusion-cli #4239
- Bug parsing interval with negative values #4237
cargo test
reports errors on the master branch. #4236- Doc of the expression function
log2
is incorrect #4231 - HashJoin with mode PartitionMode:CollectLeft has bug and can produce wrong result #4230
- Add ambiguous check when generate projection plan #4210
- What happened for NDJSON support on CLI? #4198
- Add ambiguous check when generate join plan #4197
- Clippy failing on master : error: use of deprecated associated function
chrono::NaiveDate::from_ymd
: usefrom_ymd_opt()
instead #4187 - Reimplement the
eliminate_cross_join
#4176 - Incorrect handling of column names #4166
- Update release scripts to support datafusion-benchmarks #4134
- Bug in interpreting correctly parsed SQL with aliases #4123
- The percentile argument for ApproxPercentileCont must be Float64, not Decimal128(2, 1) #4103
- Panic when using array_agg #4080
- Wrong result for FIRST_VALUE AND LAST_VALUE window functions #4076
- Round error when casting float to decimal #4071
- Predicate still has cast when comparing Timestamp(Nano, None) to a timestamp literal, so can't be pushed down or used for pruning #3938
- Revisit required_child_distribution(), output_partitioning(), output_ordering() implementations in ExecutionPlan's implementations #3653
- Can't push down projection after do type coercion #3583
- In some circumstances cast expression is not working #3499
- output_partitioning() and output_ordering() implementations are wrong in some physical plan implementations with alias #3400
- Interval Literal doesn't work for timeunit less than millisecond #3204
INTERVAL
literal with duplicated interval types should raise error #3183- Error occurs when only using partition columns in query #1999
- regex_match does not compile using the
g
flag #1429 between
with NULL literals does not work: can't be evaluated because there isn't a common type to coerce the types to #1193- [Datafusion] Error with CAST: Unsupported SQL type Time #193
Closed issues:
- SQL level coverage for when memory limit is exceeded #4404
- Throw error (not
panic
) if a listing table specifies an missing partition column #4350 - Page index pruning fail on complex_expr #4317
- optimize
limit-full join
in the limit push down rule #4275 infer_schema
function is not working with s3 Urls or http endpoints #4269- Add support binary boolean operators with nulls #4241
- Add additional testing to parquet predicate pushdown integration tests #4087
- Add metrics for parquet page level skipping #4086
- Add parquet page index pushdown metrics #4058
- Throw a runtime error if the memory allocated to GroupByHash exceeds a limit #3940
- support unsigned numeric data type in UnwrapCastInBinaryComparison rule #3702
- Support type cast in union #2125
- [EPIC] Memory Limited Sort (Externalized / Spill) #1568
- Maintain partition information in Union #189
- Add coercion support for
NULL
literals #185
Merged pull requests:
- Make
datafusion-sql
depend onarrow-schema
instead ofarrow
#4456 [sql] (mbrobbel) - replace the comparator for
decimal array op scalar
using arrow kernel #4453 (liukun4515) - Fix pyarrow test #4450 (mvanschellebeeck)
- Replace
&Option<T>
withOption<&T>
#4446 [sql] (askoa) - Improve error handling for array downcasting #4445 (retikulum)
- Refactor Builtin Window Function Implementation #4441 (mustafasrepo)
- feat:
DataFusionError::find_root
#4437 (crepererum) - fix: do NOT convert errors to strings but keep the type #4436 (crepererum)
- The CLI panics when passing an invalid explain query #4429 (comphead)
- [minor] use arrow kernel concat_batches instead combine_batches #4423 (Ted-Jiang)
- fix panic on to_hex function for negative numbers #4422 (retikulum)
- Optimize filter executor in pull-based executor #4421 (xudong963)
- optimize limit push for join case #4411 (liukun4515)
- Add integration test for erroring when memory limits are hit #4406 (alamb)
- feat:
ResourceExhausted
for memory limit inAggregateStream
#4405 (crepererum) - Update to arrow 28 #4400 [sql] (tustvold)
- Update rstest requirement from 0.15.0 to 0.16.0 #4399 (dependabot[bot])
- Add sqllogictests (v0) #4395 (mvanschellebeeck)
- improve hashjoin execution metrics #4394 (AssHero)
- Add
with_new_inputs
for LogicalPlan #4393 (jackwener) - Clean the code in
limit.rs
. #4391 (HaoYang670) - Move physical plan serde from Ballista to DataFusion #4390 (Kikkon)
- Fix page index pruning fail on complex_expr #4387 (Ted-Jiang)
- Add check for nested types in equivalent names and types #4380 (alamb)
- refine the code of build schema for ambiguous check, factor this out into a function #4379 [sql] (AssHero)
- Refactor the Hash Join #4377 (liukun4515)
- Minor: Fix typos in the documentation #4376 (martin-g)
- Include byte size estimates in the filter statistics #4375 (isidentical)
- HashJoin should return Err when the right side input stream produce Err, add more join UTs to cover different join types #4373 [sql] (mingmwang)
- feat:
ResourceExhausted
for memory limit inGroupedHashAggregateStream
#4371 (crepererum) - Use limit() function instead of show_limit() in the first example #4369 (martin-g)
- Update env_logger requirement from 0.9 to 0.10 #4367 (dependabot[bot])
- reimplement
push_down_filter
to remove global-state #4365 (jackwener) - Support to use Schedular in tpch benchmark #4361 (xudong963)
- Adding more dataframe example to read csv files #4360 (DataPsycho)
- minor: correct name and typo #4359 (jackwener)
- Do not log error if page index can not be evaluated #4358 (alamb)
- Clean the
expr_fn
- usescalar_expr
to create unary scalar expr functions, remove macrounary_scalar_functions
#4357 (HaoYang670) - Throw error (not
panic
) if a listing table specifies an missing partition column #4354 (doki23) - Improve error handling and add some more types for proper downcasting #4352 (retikulum)
- Add check to avoid underflow in memory manager #4351 (askoa)
- Improve error messages when memory is exhausted while sorting #4348 (alamb)
- Do not error in optimizer if resulting schema has different metadata #4347 (alamb)
- minor: improve optimizer logging and do not repeat rule name #4345 (alamb)
- minor: fix typos in test names #4344 [sql] (alamb)
- Minor: Add docstrings to
EliminateOuterJoins
optimizer pass #4343 (alamb) - Minor: refactor: isolate common memory accounting utils #4341 (crepererum)
- minor: make
plan_from_tables
return one plan instead ofVec
#4336 [sql] (jackwener) - enhancement: when fetch == 0, pushdown limit 0 instead skip+fetch. #4334 (jackwener)
- Teach optimizer that
CoalesceBatchesExec
does not destroy output order #4332 (alamb) - Add ability to disable DiskManager #4330 (tustvold)
- Update cli.md #4329 (psvri)
- fix bug: right semi join can't support the filter #4327 (liukun4515)
- reimplment
eliminate_limit
to removeglobal-state
. #4324 (jackwener) - Refine Err propagation and avoid unwrap in transform closures #4318 (mingmwang)
- Add a checker to confirm physical optimizer rules will keep the physical plan schema immutable #4316 (mingmwang)
- Refactor downcasting functions with downcastvalue macro and improve error handling of
ListArray
downcasting #4313 (retikulum) - minor: add another test case to cover join ambiguous check #4305 [sql] (ygf11)
- Fix DESCRIBE statement qualified table issue #4304 [sql] (gruuya)
- Use tournament loser tree for k-way sort-merging, increase merge speed by 50% #4301 (richox)
- Pin Python
setuptools
in the CI to fix integration tests #4296 (isidentical) - Support
SubqueryAlias
in optimizer, physcial planner. #4293 (jackwener) - minor: avoid a clone into string when checking ambiguous #4292 [sql] (ygf11)
- replace the comparison op for decimal array op using the arrow-rs kernel #4290 (liukun4515)
- MINOR: replace
{..}
with(_)
, typo, remove outdated TODO #4286 (jackwener) - Reduce Expr copies in
ParquetExec
#4283 (alamb) - Fix issue in filter pushdown with overloaded projection index #4281 (thinkharderdev)
- Skip useless pruning predicates in
ParquetExec
#4280 (alamb) - Push down more predicates into
ParquetExec
#4279 (alamb) - Fix EXPLAIN plan for ParquetExec to show pruning_predicate #4278 (alamb)
- reimplement
limit_push_down
to remove global-state, enhance optimize and simplify code. #4276 (jackwener) - Bump actions/labeler from 4.0.2 to 4.1.0 #4274 (dependabot[bot])
- Remove the type alias
NullColumnarValue
#4273 (HaoYang670) - reimplement
eliminate_outer_join
#4272 (jackwener) - Fix bugs in parsing
with header row
andpartitioned by
#4268 [sql] (HaoYang670) - improve error messages while downcasting
UInt32Array
,UInt64Array
andBooleanArray
#4261 (retikulum) - add ambiguous check for projection #4260 [sql] (AssHero)
- Add ambiguous check for join #4258 [sql] (ygf11)
- support cross_join in
limit_push_down
#4257 (jackwener) - Support parquet page filtering on min_max for
decimal128
andstring
columns #4255 (Ted-Jiang) - fix conflict and UT, cleanup redundant legacy code #4252 (jackwener)
- Minor: remove unecessary clone() in planner #4249 [sql] (alamb)
- Fix nightly clippy failures #4246 (mvanschellebeeck)
- Improve Error Handling and Readibility for downcasting
Float32Array
,Float64Array
,StringArray
#4244 (retikulum) - Use defaults for ListingOptions builder #4243 (mvanschellebeeck)
- Support binary boolean operators with nulls #4242 (Ted-Jiang)
- Fixing doc of the expression #4240 (Creampanda)
- Fix negative interval parsing bug #4238 (Jefffrey)
- remove duplicate or redundant code #4235 (jackwener)
- add a checker to confirm optimizer can keep plan schema immutable. #4233 (jackwener)
- Fix the percentile argument for ApproxPercentileCont must be Float64, not Decimal128(2, 1) #4228 (comphead)
- refactor how we create listing tables #4227 (timvw)
- Update sqlparser requirement from 0.26 to 0.27 #4226 [sql] (alamb)
- upgrade required chrono version to 0.4.23 #4225 (waitingkuo)
- Support types other than String for partition columns on ListingTables #4221 (doki23)
- [CBO] JoinSelection Rule, select HashJoin Partition Mode based on the Join Type and available statistics, option for SortMergeJoin #4219 (mingmwang)
- Remove alias in Union #4212 (jackwener)
- Add try_optimize method #4208 (andygrove)
- Provide a builder for ListingOptions with fixups #4207 (alamb)
- Avoid error with empty iterators used for
ScalarValue::iter_to_array
#4206 (GrandChaman) - Improve error message for regexp_match 'g' flag #4203 (Jefffrey)
- Return
ResourceExhausted
errors when memory limit is exceed inGroupedHashAggregateStreamV2
(Row Hash) #4202 (crepererum) - Add additional expr boolean simplification rules #4200 (Jefffrey)
- Update to arrow and parquet 27.0.0 #4199 [sql] (tustvold)
- Support
create table
with explicit column definitions #4194 [sql] (doki23) - Support all equality predicates in equality join #4193 [sql] (ygf11)
- add
propagate_empty_relation
optimizer rule #4192 (jackwener) - fix clippy #4190 [sql] (jackwener)
- Fix clippy by avoiding deprecated functions in chrono #4189 (alamb)
- Disallow duplicate interval types during parsing #4188 (Jefffrey)
- Parse nanoseconds for intervals #4186 (Jefffrey)
- Add rule to reimplement
Eliminate cross join
and remove it in planner #4185 [sql] (jackwener) - [FOLLOWUP] Enforcement Rule: resolve review comments, refactor adjust_input_keys_ordering() #4184 (mingmwang)
- Simplify boolean parquet pushdown predicate #4182 (Jefffrey)
- Minor: consolidate parquet
custom_reader
integration test into parquet_exec #4175 (alamb) - minor: remove redundant println and cleanup #4173 (jackwener)
- Add ability to specify external sort information for ListingTables #4170 (alamb)
- Improve Error Handling and Readibility for downcasting
Decimal128Array
#4168 (retikulum) - Minor: Remove completed comment on parquet row group pruning #4167 (alamb)
- Update hashbrown requirement from 0.12 to 0.13 #4164 (dependabot[bot])
- MINOR: enable
dyn_cmp_dict
feature on arrow for physical expr crate #4163 (isidentical) - Derive filter statistic estimates from the predicate expression #4162 (isidentical)
- Minor: pass
ParquetFileMetrics
tobuild_row_filter
in parquet #4161 (alamb) - Minor: Extract parquet row group pruning code into its own module #4160 (alamb)
- Full support for time32 and time64 literal values (
ScalarValue
) #4156 (andre-cc-natzka) - Window frame GROUPS mode support #4155 (zembunia)
- Improve error messages while downcasting Int64Array #4154 (retikulum)
- Add another method to collect referenced columns from an expression #4153 [sql] (ygf11)
- Remove BoxedAsyncFileReader #4150 (tustvold)
- Support unsigned integers in
unwrap_cast_in_comparison
Optimizer rule #4149 (alamb) - Add support for
DataType::Timestamp
casts inunwrap_cast_in_comparison
optimizer pass #4148 (alamb) - Add additional testing for
unwrap_cast_in_comparison
#4147 (alamb) - improve error messages while downcasting Int32Array #4146 (retikulum)
- Minor: Update docstring on unwrap_cast_in_comparison #4145 (alamb)
- add schema parameter to table provider factory create method #4143 (milenkovicm)
- fix: shouldn't pass alias through into subquery. #4141 [sql] (jackwener)
- Preserve the
Cast
expression incolumnize_expr
#4137 [sql] (HaoYang670) - Set versions to dependencies with path in benchmarks Cargo.toml file #4136 (ArkashaJavelin)
- Fix links #4135 (mvanschellebeeck)
- Use f64::total_cmp instead of OrderedFloat #4133 (comphead)
- Add parquet integration tests for explicitly smaller page sizes, page pruning #4131 (alamb)
- Consolidate
ParquetExec
tests inparquet_exec
integration test #4130 (alamb) - Minor: Use upstream
BooleanArray::true_count
#4129 (alamb) - Combined TPCH runs & uniformed summaries for benchmarks #4128 (isidentical)
- Enable TableProviderFactories to receive additional options when creating an external table #4126 [sql] (timvw)
- Add CI check that configs.md is up-to-date #4124 (mvanschellebeeck)
- [Part3] Partition and Sort Enforcement, Enforcement rule implementation #4122 (mingmwang)
- reuse code
utils::optimize_children
but affect inline. #4121 (jackwener) - reuse code
utils::optimize_children
instead of redundant implementation #4119 (jackwener) - Allow listing tables to be created via TableFactories #4112 (avantgardnerio)
- Update SQL reference to state that decimal support is currently experimental #4109 (andygrove)
- Add metrics for parquet page level skipping #4105 (Ted-Jiang)
- Add parser option for parsing SQL numeric literals as decimal #4102 [sql] (andygrove)
- Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4079 (yahoNanJing)
- Custom window frame support extended to built-in window functions #4078 (mustafasrepo)
- Enable tests for page index filtering in parquet filter pushdown test #4062 (alamb)
- [Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043 (mingmwang)
- add support for xz file compression and
compression
feature #3993 [sql] (Jimexist) - Expression boundary analysis framework #3912 (isidentical)