Change log

Generated on 2024-07-18

Release 24.06

Features


#10850	[FEA] Refine the test framework introduced in #10745
#6969	[FEA] Support parse_url
#10496	[FEA] Drop support for CentOS7
#10760	[FEA]Support ArrayFilter
#10721	[FEA] Dump the complete set of build-info properties to the Spark eventLog
#10666	[FEA] Create Spark 3.4.3 shim

Performance


#8963	[FEA] Use custom kernel for parse_url
#10817	[FOLLOW ON] Combining regex parsing in transpiling and regex rewrite in `rlike`
#10821	Rewrite `pattern[A-B]{X,Y}` (a pattern string followed by X to Y chars in range A - B) in `RLIKE` to a custom kernel

Bugs Fixed


#10928	[BUG] 24.06 test_conditional_with_side_effects_case_when test failed on Scala 2.13 with DATAGEN_SEED=1716656294
#10941	[BUG] Failed to build on databricks due to GpuOverrides.scala:4264: not found: type GpuSubqueryBroadcastMeta
#10902	Spark UT failed: SPARK-37360: Timestamp type inference for a mix of TIMESTAMP_NTZ and TIMESTAMP_LTZ
#10899	[BUG] format_number Spark UT failed because Type conversion is not allowed
#10913	[BUG] rlike with empty pattern failed with 'NoSuchElementException' when enabling regex rewrite
#10774	[BUG] Issues found by Spark UT Framework on RapidsRegexpExpressionsSuite
#10606	[BUG] Update Plugin to use the new `getPartitionedFile` method
#10806	[BUG] orc_write_test.py::test_write_round_trip_corner failed with DATAGEN_SEED=1715517863
#10831	[BUG] Failed to read data from iceberg
#10810	[BUG] NPE when running `ParseUrl` tests in `RapidsStringExpressionsSuite`
#10797	[BUG] udf_test test_single_aggregate_udf, test_group_aggregate_udf and test_group_apply_udf_more_types failed on DB 13.3
#10719	[BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217
#10738	[BUG] test_exact_percentile_groupby_partial_fallback_to_cpu failed with DATAGEN_SEED=1713928179
#10768	[DOC] Dead links with tools pages
#10751	[BUG] Cascaded Pandas UDFs not working as expected on Databricks when plugin is enabled
#10318	[BUG] `fs.azure.account.keyInvalid` configuration issue while reading from Unity Catalog Tables on Azure DB
#10722	[BUG] "Could not find any rapids-4-spark jars in classpath" error when debugging UT in IDEA
#10724	[BUG] Failed to convert string with invisible characters to float
#10633	[BUG] ScanJson and JsonToStructs can give almost random errors
#10659	[BUG] from_json ArrayIndexOutOfBoundsException in 24.02
#10656	[BUG] Databricks cache tests failing with host memory OOM

PRs


#11221	Change cudf version back to 24.06.0-SNAPSHOT [skip ci]
#11217	Update latest changelog [skip ci]
#11211	Use fixed seed for test_from_json_struct_decimal
#11203	Update version to 24.06.1-SNAPSHOT
#11205	Update docs for 24.06.1 release [skip ci]
#11056	Update latest changelog [skip ci]
#11052	Add spark343 shim for scala2.13 dist jar
#10981	Update latest changelog [skip ci]
#10984	[DOC] Update docs for 24.06.0 release [skip ci]
#10974	Update rapids JNI and private dependency to 24.06.0
#10947	Prevent contains-PrefixRange optimization if not preceded by wildcards
#10934	Revert "Add Support for Multiple Filtering Keys for Subquery Broadcast "
#10870	Add support for self-contained profiling
#10903	Use upper case for LEGACY_TIME_PARSER_POLICY to fix a spark UT
#10900	Fix type convert error in format_number scalar input
#10868	Disable default cuDF pinned pool
#10914	Fix NoSuchElementException when rlike with empty pattern
#10858	Add Support for Multiple Filtering Keys for Subquery Broadcast
#10861	refine ut framework including Part 1 and Part 2
#10872	[DOC] ignore released plugin links to reduce the bother info [skip ci]
#10839	Replace anonymous classes for SortOrder and FIlterExec overrides
#10873	Auto merge PRs to branch-24.08 from branch-24.06 [skip ci]
#10860	[Spark 4.0] Account for `PartitionedFileUtil.getPartitionedFile` signature change.
#10822	Rewrite regex pattern `literal[a-b]{x}` to custom kernel in rlike
#10833	Filter out unused json_path tokens
#10855	Fix auto merge conflict 10845 [[skip ci]]
#10826	Add NVTX ranges to identify Spark stages and tasks
#10846	Update latest changelog [skip ci]
#10836	Catch exceptions when trying to examine Iceberg scan for metadata queries
#10824	Support zstd for GPU shuffle compression
#10828	Added DateTimeUtilsShims [Databricks]
#10829	Fix `Inheritance Shadowing` to add support for Spark 4.0.0
#10811	Fix NPE in GpuParseUrl for null keys.
#10723	Implement chunked ORC reader
#10715	Rewrite some rlike expression to StartsWith/Contains
#10820	workaround #10801 temporally
#10812	Replace ThreadPoolExecutor creation with ThreadUtils API
#10816	Fix a test error for DB13.3
#10813	Fix the errors for Pandas UDF tests on DB13.3
#10795	Remove fixed seed for exact `percentile` integration tests
#10805	Drop Support for CentOS 7
#10800	Add number normalization test and address followup for getJsonObject
#10796	fixing build break on DBR
#10791	Fix auto merge conflict 10779 [skip ci]
#10636	Update actions version [skip ci]
#10743	initial PR for the framework reusing Vanilla Spark's unit tests
#10767	Add rows-only batches support to RebatchingRoundoffIterator
#10763	Add in the GpuArrayFilter command
#10766	Fix dead links related to tools documentation [skip ci]
#10644	Add logging to Integration test runs in local and local-cluster mode
#10756	Fix Authorization Failure While Reading Tables From Unity Catalog
#10752	Add SparkRapidsBuildInfoEvent to the event log
#10754	Substitute whoami for $USER
#10755	[DOC] Update README for prioritize-commits script [skip ci]
#10728	Let big data gen set nullability recursively
#10740	Use parse_url kernel for PATH parsing
#10734	Add short circuit path for get-json-object when there is separate wildcard path
#10725	Initial definition for Spark 4.0.0 shim
#10635	Use new getJsonObject kernel for json_tuple
#10739	Use fixed seed for some random failed tests
#10720	Add Shims for Spark 3.4.3
#10716	Remove the mixedType config for JSON as it has no downsides any longer
#10733	Fix "Could not find any rapids-4-spark jars in classpath" error when debugging UT in IDEA
#10718	Change parameters for memory limit in Parquet chunked reader
#10292	Upgrade to UCX 1.16.0
#10709	Removing some authorizations for departed users [skip ci]
#10726	Append new authorized user to blossom-ci whitelist [skip ci]
#10708	Updated dump tool to verify get_json_object
#10706	Fix auto merge conflict 10704 [skip ci]
#10675	Fix merge conflict with branch-24.04 [skip ci]
#10678	Append new authorized user to blossom-ci whitelist [skip ci]
#10662	Audit script - Check commits from shuffle and storage directories [skip ci]
#10655	Update rapids jni/private dependency to 24.06
#10652	Substitute murmurHash32 for spark32BitMurmurHash3

Release 24.04

Features


#10263	[FEA] Add support for reading JSON containing structs where rows are not consistent
#10436	[FEA] Move Spark 3.5.1 out of snapshot once released
#10430	[FEA] Error out when running on an unsupported GPU architecture
#9750	[FEA] Review `JsonToStruct` and `JsonScan` and consolidate some testing and implementation
#8680	[AUDIT][SPARK-42779][SQL] Allow V2 writes to indicate advisory shuffle partition size
#10429	[FEA] Drop support for Databricks 10.4 ML LTS
#10334	[FEA] Turn on memory limits for parquet reader
#10344	[FEA] support barrier mode for mapInPandas/mapInArrow

Performance


#10578	[FEA] Support project expression rewrite for the case `stringinstr(str_col, substr) > 0` to `contains(str_col, substr)`
#10570	[FEA] See if we can optimize sort for a single batch
#10531	[FEA] Support "WindowGroupLimit" optimization on GPU for Databricks 13.3 ML LTS+
#5553	[FEA][Audit] - Push down StringEndsWith/Contains to Parquet
#8208	[FEA][AUDIT][SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation
#10249	[FEA] Support common subexpression elimination for expand operator
#10301	[FEA] Improve performance of from_json

Bugs Fixed


#10700	[BUG] get_json_object cannot handle ints or boolean values
#10645	[BUG] java.lang.IllegalStateException: Expected to only receive a single batch
#10665	[BUG] Need to update private jar's version to v24.04.1 for spark-rapids v24.04.0 release
#10589	[BUG] ZSTD version mismatch in integration tests
#10255	[BUG] parquet_tests are skipped on Dataproc CI
#10624	[BUG] Deploy script "gpg:sign-and-deploy-file failed: 401 Unauthorized
#10631	[BUG] pending `BlockState` leaks blocks if the shuffle read doesn't finish successfully
#10349	[BUG]Test in json_test.py failed: test_from_json_struct_decimal
#9033	[BUG] GpuGetJsonObject does not expand escaped characters
#10216	[BUG] GetJsonObject fails at spark unit test $.store.book[*].reader
#10217	[BUG] GetJsonObject fails at spark unit test $.store.basket[0][*].b
#10537	[BUG] GetJsonObject throws exception when json path contains a name starting with `'`
#10194	[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark
#10196	[BUG] GetJsonObject does not process escape sequences in returned strings or queries
#10212	[BUG] GetJsonObject should return null for invalid query instead of throwing an exception
#10218	[BUG] GetJsonObject does not normalize non-string output
#10591	[BUG] `test_column_add_after_partition` failed on EGX Standalone cluster
#10277	Add monitoring for GH action deprecations
#10627	[BUG] Integration tests FAILED on: "nvCOMP 2.3/2.4 or newer is required for Zstandard compression"
#10585	[BUG]Test simple pinned blocking alloc Failed nightly tests
#10586	[BUG] YARN EGX IT build failing parquet_testing_test can't find file
#10133	[BUG] test_hash_reduction_collect_set_on_nested_array_type failed in a distributed environment
#10378	[BUG] `test_range_running_window_float_decimal_sum_runs_batched` fails intermittently
#10486	[BUG] StructsToJson does not fall back to the CPU for unsupported timeZone options
#10484	[BUG] JsonToStructs does not fallback when columnNameOfCorruptRecord is set
#10460	[BUG] JsonToStructs should reject float numbers for integer types
#10468	[BUG] JsonToStructs and ScanJson should not treat quoted strings as valid integers
#10470	[BUG] ScanJson and JsonToStructs should support parsing quoted decimal strings that are formatted by local (at least for en-US)
#10494	[BUG] JsonToStructs parses INF wrong when nonNumericNumbers is enabled
#10456	[BUG] allowNonNumericNumbers OFF supported for JSON Scan, but not JsonToStructs
#10467	[BUG] JsonToStructs should reject 1. as a valid number
#10469	[BUG] ScanJson should accept "1." as a valid Decimal
#10559	[BUG] test_spark_from_json_date_with_format FAILED on : Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec
#10209	[BUG] Test failure hash_aggregate_test.py::test_hash_reduction_collect_set_on_nested_array_type DATAGEN_SEED=1705515231
#10319	[BUG] Shuffled join OOM with 4GB of GPU memory
#10507	[BUG] regexp_test.py FAILED test_regexp_extract_all_idx_positive[DATAGEN_SEED=1709054829, INJECT_OOM]
#10527	[BUG] Build on Databricks failed with GpuGetJsonObject.scala:19: object parsing is not a member of package util
#10509	[BUG] scalar leaks when running nds query51
#10214	[BUG] GetJsonObject does not support unquoted array like notation
#10215	[BUG] GetJsonObject removes leading space characters
#10213	[BUG] GetJsonObject supports array index notation without a root
#10452	[BUG] JsonScan and from_json share fallback checks, but have hard coded names in the results
#10455	[BUG] JsonToStructs and ScanJson do not fall back/support it properly if single quotes are disabled
#10219	[BUG] GetJsonObject sees a double quote in a single quoted string as invalid
#10431	[BUG] test_casting_from_overflow_double_to_timestamp `DID NOT RAISE <class 'Exception'>`
#10499	[BUG] Unit tests core dump as below
#9325	[BUG] test_csv_infer_schema_timestamp_ntz fails
#10422	[BUG] test_get_json_object_single_quotes failure
#10411	[BUG] Some fast parquet tests fail if the time zone is not UTC
#10410	[BUG]delta_lake_update_test.py::test_delta_update_partitions[['a', 'b']-False] failed by DATAGEN_SEED=1707683137
#10404	[BUG] GpuJsonTuple memory leak
#10382	[BUG] Complile failed on branch-24.04 : literals.scala:32: object codec is not a member of package org.apache.commons

PRs


#10844	Update rapids private dependency to 24.04.3
#10788	[DOC] Update archive page for v24.04.1 [skip ci]
#10784	Update latest changelog [skip ci]
#10782	Update latest changelog [skip ci]
#10780	[DOC]Update download page for v24.04.1 [skip ci]
#10778	Update version to 24.04.1-SNAPSHOT
#10777	Update rapids JNI dependency: private to 24.04.2
#10683	Update latest changelog [skip ci]
#10681	Update rapids JNI dependency to 24.04.0, private to 24.04.1
#10660	Ensure an executor broadcast is in a single batch
#10676	[DOC] Update docs for 24.04.0 release [skip ci]
#10654	Add a config to switch back to old impl for getJsonObject
#10667	Update rapids private dependency to 24.04.1
#10664	Remove build link from the premerge-CI workflow
#10657	Revert "Host Memory OOM handling for RowToColumnarIterator (#10617)"
#10625	Pin to 3.1.0 maven-gpg-plugin in deploy script [skip ci]
#10637	Cleanup async state when multi-threaded shuffle readers fail
#10617	Host Memory OOM handling for RowToColumnarIterator
#10614	Use random seed for `test_from_json_struct_decimal`
#10581	Use new jni kernel for getJsonObject
#10630	Fix removal of internal metadata information in 350 shim
#10623	Auto merge PRs to branch-24.06 from branch-24.04 [skip ci]
#10616	Pass metadata extractors to FileScanRDD
#10620	Remove unused shared lib in Jenkins files
#10615	Turn off state logging in HostAllocSuite
#10610	Do not replace TableCacheQueryStageExec
#10599	Call globStatus directly via PY4J in hdfs_glob to avoid calling hadoop command
#10602	Remove InMemoryTableScanExec support for Spark 3.5+
#10608	Update perfio.s3.enabled doc to fix build failure [skip ci]
#10598	Update CI script to build and deploy using the same CUDA classifier[skip ci]
#10575	Update JsonToStructs and ScanJson to have white space normalization
#10597	add guardword to hide cloud info
#10540	Handle minimum GPU architecture supported
#10584	Add in small optimization for instr comparison
#10590	Turn on transition logging in HostAllocSuite
#10572	Improve performance of Sort for the common single batch use case
#10568	Add configuration to share JNI pinned pool with cuIO
#10550	Enable window-group-limit optimization on
#10542	Make JSON parsing common between JsonToStructs and ScanJson
#10562	Fix test_spark_from_json_date_with_format when run in a non-UTC TZ
#10564	Enable specifying specific integration test methods via TESTS environment
#10563	Append new authorized user to blossom-ci safelist [skip ci]
#10520	Distinct left join
#10538	Move K8s cloud name into common lib for Jenkins CI
#10552	Fix issues when no value can be extracted from a regular expression
#10522	Fix missing scala-parser-combinators dependency on Databricks
#10549	Update to latest branch-24.02 [skip ci]
#10544	Fix merge conflict from branch-24.02
#10503	Distinct inner join
#10512	Move to parsing from_json input preserving quoted strings.
#10528	Fix auto merge conflict 10523
#10519	Replicate HostColumnVector.ColumnBuilder in plugin to enable host memory oom work
#10521	Fix Spark 3.5.1 build
#10516	One more metric for expand
#10500	Support "WindowGroupLimit" optimization on GPU
#10508	Move 351 shims into noSnapshot buildvers
#10510	Fix scalar leak in SumBinaryFixer
#10466	Use parser from spark to normalize json path in GetJsonObject
#10490	Start working on a more complete json test matrix json
#10497	Add minValue overflow check in ORC double-to-timestamp cast
#10501	Fix scalar leak in WindowRetrySuite
#10474	Remove Support for Databricks 10.4
#10418	Enable GpuShuffledSymmetricHashJoin by default
#10450	Improve internal row to columnar host memory by using a combined spillable buffer
#10440	Generate CSV data per Spark version for tools
#10449	[DOC] Fix table rendering issue in github.io download UI page [skip ci]
#10438	Integrate perfio.s3 reader
#10423	Disable Integration Test:`test_get_json_object_single_quotes` on DB 10.4
#10419	Export TZ in tests when default TZ is used
#10426	Fix auto merge conflict 10425 [skip ci]
#10427	Update test doc for 24.04 [skip ci]
#10396	Remove inactive user from github workflow [skip ci]
#10421	Use withRetry when manifesting spillable batch in GpuShuffledHashJoinExec
#10420	Disable JsonTuple by default
#10407	Enable Single Quote Support in getJSONObject API with GetJsonObjectOptions
#10415	Avoid comparing Delta logs when writing partitioned tables
#10247	Improve `GpuExpand` by pre-projecting some columns
#10248	Group-by aggregation based optimization for UNBOUNDED `collect_set` window function
#10406	Enabled subPage chunking by default
#10361	Add in basic support for JSON generation in BigDataGen and improve performance of from_json
#10158	Add in framework for unbounded to unbounded window agg optimization
#10394	Fix auto merge conflict 10393 [skip ci]
#10375	Support barrier mode for mapInPandas/mapInArrow
#10356	Update locate_parquet_testing_files function to support hdfs input path for dataproc CI
#10369	Revert "Support barrier mode for mapInPandas/mapInArrow (#10364)"
#10358	Disable Spark UI by default for integration tests
#10360	Fix a memory leak in json tuple
#10364	Support barrier mode for mapInPandas/mapInArrow
#10348	Remove redundant joinOutputRows metric
#10321	Bump up dependency version to 24.04.0-SNAPSHOT
#10330	Add tryAcquire to GpuSemaphore
#10258	Init project version 24.04.0-SNAPSHOT

Older Releases

Changelog of older releases can be found at docs/archives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Change log

Release 24.06

Features

Performance

Bugs Fixed

PRs

Release 24.04

Features

Performance

Bugs Fixed

PRs

Older Releases

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change log

Release 24.06

Features

Performance

Bugs Fixed

PRs

Release 24.04

Features

Performance

Bugs Fixed

PRs

Older Releases