diff --git a/_posts/2024-07-16-17.0.0-release.md b/_posts/2024-07-16-17.0.0-release.md new file mode 100644 index 000000000000..ffc29694b7f8 --- /dev/null +++ b/_posts/2024-07-16-17.0.0-release.md @@ -0,0 +1,248 @@ +--- +layout: post +title: "Apache Arrow 17.0.0 Release" +date: "2024-07-16 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 17.0.0 release. This covers +over 3 months of development work and includes [**331 resolved issues**][1] +on [**529 distinct commits**][2] from [**92 distinct contributors**][2]. +See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 16.0.0 release, Dane Pitkin has been invited to be committer. +No new members have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Linux packages notes + +- We dropped support for Debian GNU/Linux bullseye + +## C Data Interface notes + +- `ArrowDeviceArrayStream` can now be imported and exported (GH-40078) + +## Arrow Flight RPC notes + +- Flight SQL was formally stabilized (GH-39204). +- Flight SQL added a bulk ingestion command (GH-38255). +- The JDBC Flight SQL driver now accepts "catalog" as a connection parameter (GH-41947). +- "Stateless" prepared statements are now supported (GH-37220, GH-41262). +- Java added `FlightStatusCode.RESOURCE_EXHAUSTED` (GH-35888). +- C++ has some basic support for logging with OpenTelemetry (GH-39898). + +## C++ notes + +For C++ notes refer to the full changelog. + +### Highlights + +- Half-float values can now be parsed and formatted correctly (GH-41089). +- Record batches can now be converted to row-major tensors, not only column-major (GH-40866). +- The CSV writer is now able to write large string arrays that are larger than + 2 GiB (GH-40270). +- A possible invalid memory access in `BooleanArray.true_count()` has been fixed (GH-41016). +- A new method `FlattenRecursively` allows recursive nesting of list and + fixed-size list arrays (GH-41055). +- The scratch space in some `Scalar` subclasses is now immutable. This is required + for proper concurrent access to `Scalar` instances (GH-40069). +- Calling the `bit_width` or `byte_width` method of an extension type now defers + to the underlying storage type (GH-41353). +- Fixed a bug where `MapArray::FromArrays` would behave incorrectly if the given + offsets array has a non-zero offset (GH-40750). +- `MapArray::FromArrays` now accepts an optional null bitmap argument + (GH-41684). +- The `ARROW_NO_DEPRECATED_API` macro was unused and has been removed (GH-41343). + +### Acero + +- The left anti join filter no longer crashes when the filter rows are empty (GH-41121). +- A race condition was fixed in the asof join (GH-41149). +- A potential stack overflow has been fixed (GH-41334, GH-41738). +- A potential crash on very large data has been fixed (GH-41813). +- Asof join and sort merge join now support single threaded mode (GH-41190). + +### Compute + +- List views and maps are now supported by the `if_else`, `case_when` and + `coalesce` functions (GH-41418). +- List views are now supported by the functions `list_slice` (GH-42065), + `list_parent_indices` (GH-42235), `take` and `filter` (GH-42116). +- `list_flatten` can now be recursive based on new optional argument + (GH-41183, GH-41055) +- The `take` and `filter` functions have been made significantly faster on fixed-width + types, including fixed-size lists of fixed-width types (GH-39798). + +### Dataset + +- Repeated scanning of an encrypted Parquet dataset now works correctly (GH-41431). + +### Filesystems + +- Standard filesystem implementations are now tracked in a global registry which + also allows loading third-party filesystem implementations, for example from + runtime-loaded DLLs (GH-40342, +- Directory metadata operations on Azure filesystems are now more aligned with + the common expectations for filesystems (GH-41034). +- `CopyFile` is now supported for Azure filesystems with hierarchical namespace + enabled (GH-41095). +- Azure credentials can now be loaded explicitly from the environment (GH-39345), + or using the Azure CLI (GH-39344). +- A potential deadlock was fixed when closing an S3 output stream (GH-41862). + +### GPU + +- Non-CPU data can now be pretty-printed (GH-41664). +- Non-CPU data with offsets, such as list and binary data, can now be properly + sent over IPC (GH-42198). + +### IPC + +- Flatbuffers serialization is now more deterministic (GH-40361). + +### Parquet + +- A crash was fixed when reading an invalid Parquet file where columns claim to + be of different lengths (GH-41317). +- Definition and repetition levels are now more strictly checked, avoiding later + crashes when reading an invalid Parquet file (GH-41321). +- A crash was fixed when reading an invalid encrypted Parquet file (GH-43070). +- Fixed a bug where `DeltaLengthByteArrayEncoder::EstimatedDataEncodedSize` could + return an invalid estimate in some situations (GH-41545). +- Delimiting records is now faster for columns with nested repeating (GH-41361). + +### Substrait + +- Support for more Arrow data types was added: some temporal types, half floats, + large string and large binary (GH-40695). + +## C# notes + +- The performance of building Decimal arrays using SqlDecimal values was improved for .NET 7+ (GH-41349) +- Scalar arrays now implement `ICollection` (GH-38692) +- Concatenating arrays with a non-zero offset with ArrowArrayConcatenator was fixed (GH-41164) +- Concatenating union arrays with ArrowArrayConcatenator was fixed (GH-41198) +- Accessing values of decimal arrays with a non-zero offset was fixed (GH-41199) + +## Go Notes + +### Bug Fixes + +#### Arrow + +- Prevent exposure of invalid Go pointers in CGO code ([GH-43062](https://github.com/apache/arrow/issues/43062)) +- Fix memory leak for 0-length C array imports ([GH=41534](https://github.com/apache/arrow/issues/41534)) +- Ensure statement handle is updated so stateless prepared statements work properly ([GH-41427](https://github.com/apache/arrow/issues/41427)) + +#### Parquet + +- Fix memory leak in BufferedPageWriter ([GH-41697](https://github.com/apache/arrow/issues/41697)) +- Fix performance regression in PooledBufferWriter ([GH-41541](https://github.com/apache/arrow/issues/41541)) + +### Enhancements + +#### Arrow + +- Arrow Schemas and Records can now be created from Protobuf messages ([GH-40494](https://github.com/apache/arrow/issues/40494)) + +#### Parquet + +- Performance improvement for BitWriter VlqInt ([GH-41160](https://github.com/apache/arrow/pull/41160)) + +## Java notes + +**Some changes are coming up in the next version, Arrow 18. Java 8 support will be removed. The version of the flight-core artifact with shaded gRPC will no longer be distributed.** + +- Basic support for ListView (GH-41287) and StringView (GH-40339) has been added. These types should still be considered experimental. + +## JavaScript notes + +- General maintenance. Clean up packaging ([GH-39722](https://github.com/apache/arrow/issues/39722)), update dependencies ([GH-41905](https://github.com/apache/arrow/issues/41905)). + +## Python notes + +### Compatibility notes: +* To ensure Python 3.13 compatibility, _Py_IsFinalizing has been replaced with a public API (GH-41475). +* The C Data Interface now supports CUDA devices (GH-40384). + +### New features: +* Added support for Emscripten via Pyodide (GH-41910). + +### Other improvements: +* The ParquetWriter added the store_decimal_as_integer option (GH-42168). +* The Float16 logical type is supported in Parquet (GH-42016). +* Exposed bit_width and byte_width to extension types (GH-41389). +* Added bindings for Device and MemoryManager classes (GH-41126). +* The PyCapsule interface now exposes the device interface (GH-38325). +* Various PyArrow APIs have been updated to work with non-CPU architectures gracefully. (GH-42112, GH-41664, GH-41662, + +### Relevant bug fixes: +* Fixed Numpy 2.0 compatibility issues (GH-42170, GH-41924). +* Fixed sporadic as_of join failures (GH-41149). +* Fixed a bug in RecordBatch.filter() when passing a ChunkedArray, which would cause a segfault (GH-38770). +* Fixed a bug in RecordBatch.from_arrays() when passing a storage array, which would cause a segfault (GH-37669). +* Fixed a bug where constructing a MapArray from Array could drop nulls (GH-41684). +* FIxed a bug where RunEndEncodedArray.from_arrays fails if run_ends are pyarrow.Array (GH-40560). +* FIxed a regression introduced in PyArrow v16 in RecordBatchReader.cast() (GH-41884). + +## R notes + +* R functions that users write that use functions that Arrow supports in dataset queries now can be used in queries too. Previously, only functions that used arithmetic operators worked. For example, `time_hours <- function(mins) mins / 60` worked, but `time_hours_rounded <- function(mins) round(mins / 60)` did not; now both work. These are automatic translations rather than true user-defined functions (UDFs); for UDFs, see `register_scalar_function()`. [GH-41223](https://github.com/apache/arrow/issues/41223) +* `mutate()` expressions can now include aggregations, such as `x - mean(x)`. [GH-41350](https://github.com/apache/arrow/pull/41350) +* `summarize()` supports more complex expressions, and correctly handles cases where column names are reused in expressions. [GH-41323](https://github.com/apache/arrow/issues/41323) +* The `na_matches` argument to the `dplyr::*_join()` functions is now supported. This argument controls whether `NA` values are considered equal when joining. [GH-41223](https://github.com/apache/arrow/issues/41358) +* R metadata, stored in the Arrow schema to support round-tripping data between R and Arrow/Parquet, is now serialized and deserialized more strictly. This makes it safer to load data from files from unknown sources into R data.frames. [GH-41223](https://github.com/apache/arrow/issues/41969) + +For more on what’s in the 17.0.0 R package, see the [R changelog][4]. + +## Ruby and C GLib notes + +### Ruby + +- Improved `Arrow::Table#to_s` format + - This is a breaking change + +### C GLib + +- Added support for Microsoft Visual C++ +- Added `gadataset_dataset_to_record_batch_reader()` + +## Rust notes + +The Rust projects have moved to separate repositories outside the +main Arrow monorepo. For notes on the latest release of the Rust +implementation, see the latest [Arrow Rust changelog][5]. + +[1]: https://github.com/apache/arrow/milestone/62?closed=1 +[2]: {{ site.baseurl }}/release/17.0.0.html#contributors +[3]: {{ site.baseurl }}/release/17.0.0.html#changelog +[4]: {{ site.baseurl }}/docs/r/news/ +[5]: https://github.com/apache/arrow-rs/tags