diff --git a/_posts/2024-10-07-nanoarrow-0.6.0-release.md b/_posts/2024-10-07-nanoarrow-0.6.0-release.md new file mode 100644 index 000000000000..3496763d5f21 --- /dev/null +++ b/_posts/2024-10-07-nanoarrow-0.6.0-release.md @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow nanoarrow 0.6.0 Release" +date: "2024-10-07 00:00:00" +author: pmc +categories: [release] +--- + + +The Apache Arrow team is pleased to announce the 0.6.0 release of +Apache Arrow nanoarrow. This release covers 114 resolved issues from +10 contributors. + +## Release Highlights + +- Run End Encoding support +- StringView support +- IPC Write support +- DLPack/device support +- IPC/Device available from CMake/Meson as feature flags + +See the +[Changelog](https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.6.0/CHANGELOG.md) +for a detailed list of contributions to this release. + +## Breaking Changes + +Most changes included in the nanoarrow 0.6.0 release will not break downstream +code; however, two changes with respect to packaging and distribution may require +users to update the code used to bring nanoarrow in as a dependency. + +In nanoarrow 0.5.0 and earlier, the bundled single-file amalgamation was included in +the `dist/` subdirectory or could be generated using a specially-crafted CMake +command. The nanoarrow 0.6.0 release removes the pre-compiled includes and migrates +the code used to generate it to Python. This setup is less confusing for contributors +(whose editors would frequently jump into the wrong `nanoarrow.h`) and is a less confusing +use of CMake. Users can generate the `dist/` subdirectory as it previously existed +with: + +``` shell +python ci/scripts/bundle.py \ + --source-output-dir=dist \ + --include-output-dir=dist \ + --header-namespace= \ + --with-device \ + --with-ipc \ + --with-testing \ + --with-flatcc +``` + +Second, the Arrow IPC and ArrowDeviceArray implementations previously lived in the `extensions/` +subdirectory of the repository. This was helpful during the initial development of these +features; however, the nanoarrow 0.6.0 release added the requisite feature coverage and testing +such that the appropriate home for them is now the main `src/` directory. As such, one +can now build nanoarrow with IPC and/or device support using: + +``` shell +cmake -S . -B build -DNANOARROW_IPC=ON -DNANOARROW_DEVICE=ON +``` + +## Features + +### Float16, StringView, and REE Support + +The nanoarrow 0.6.0 release adds support for Arrow's float16 (half float), string view, +and run-end encoding support. The C library supports building float16 arrays using +`ArrowArrayAppendDouble()` and supports building string view and binary view arrays +using `ArrowArrayAppendString()` and/or `ArrowArrayAppendBytes()` and supports consuming +using `ArrowArrayViewGetStringUnsafe()` and/or `ArrowArrayViewGetBytesUnsafe()`. R and +Python users can request a string view or float16 type when building an array, and +conversion back to R/Python strings is suppored. + +``` python +# pip install nanoarrow +# conda install nanoarrow -c conda-forge +import nanoarrow as na + +na.Array(["abc", "def", None], na.string_view()) +#> nanoarrow.Array[3] +#> 'abc' +#> 'def' +#> None +na.Array([1, 2, 3], na.float16()) +#> nanoarrow.Array[3] +#> 1.0 +#> 2.0 +#> 3.0 +``` + +``` r +# install.packages("nanoarrow") +library(nanoarrow) + +as_nanoarrow_array(c("abc", "def", NA), schema = na_string_view()) |> + convert_array() +#> [1] "abc" "def" NA +as_nanoarrow_array(c(1, 2, 3), schema = na_half_float()) |> + convert_array() +#> [1] 1 2 3 +``` + +Creating/consuming run-end encoding arrays by element is not yet +supported in C, R, or Python; however, arrays can be built or consumed by assembling +the correct array/buffer structure in C. + +Thank you to [cocoa-xu](https://github.com/cocoa-xu) for adding float16 and run-end encoding +support and thank you to [WillAyd](https://github.com/WillAyd) for adding string view support! + +### IPC Write Support + +The nanoarrow library has supported reading +[Arrow IPC streams](https://arrow.apache.org/docs/format/Columnar.html) +since 0.4.0; however, could not write streams of its own. The nanoarrow 0.6.0 release adds +support for stream writing from C using the `ArrowIpcWriter` and stream writing +from R and Python: + +```python +import io +import nanoarrow as na +from nanoarrow import ipc + +out = io.BytesIO() +with ipc.StreamWriter.from_writable(out) as writer: + writer.write_stream(ipc.InputStream.example()) + +out.seek(0) +na.ArrayStream.from_readable(out).read_all() +#> nanoarrow.Array>[3] +#> {'some_col': 1} +#> {'some_col': 2} +#> {'some_col': 3} +``` + +``` r +library(nanoarrow) + +tf <- tempfile() +nycflights13::flights |> write_nanoarrow(tf) + +read_nanoarrow(tf) |> tibble::as_tibble() +#> # A tibble: 336,776 × 19 +#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time +#> +#> 1 2013 1 1 517 515 2 830 819 +#> 2 2013 1 1 533 529 4 850 830 +#> 3 2013 1 1 542 540 2 923 850 +#> 4 2013 1 1 544 545 -1 1004 1022 +#> 5 2013 1 1 554 600 -6 812 837 +#> 6 2013 1 1 554 558 -4 740 728 +#> 7 2013 1 1 555 600 -5 913 854 +#> 8 2013 1 1 557 600 -3 709 723 +#> 9 2013 1 1 557 600 -3 838 846 +#> 10 2013 1 1 558 600 -2 753 745 +#> # ℹ 336,766 more rows +#> # ℹ 11 more variables: arr_delay , carrier , flight , +#> # tailnum , origin , dest , air_time , distance , +#> # hour , minute , time_hour +``` + +As a result of the IPC write support, nanoarrow now joins the Arrow IPC integration tests +to ensure compatability across implementations. With the exception of +[arrow-rs due to a bug in the Rust flatbuffers implementation](https://github.com/apache/arrow-rs/issues/5052), +nanoarrow is now tested against all participating Arrow implementations with every commit. + +A huge thank you to [bkietz](https://github.com/bkietz) for implementing this support and +the tests (which included multiple bugfixes and identification of inconsistencies of +flatbuffer verification in C, Rust, and C++!). + +### DLPack/CUDA Support + +The nanoarrow 0.6.0 release includes improved support for the +[Arrow C Device data interface](https://arrow.apache.org/docs/format/CDeviceDataInterface.html). +In particular, the CUDA device implementation was improved to more efficiently coordinate +synchronization when copying arrays to/from the GPU and migrated to use the driver API +for wider compatibility. The nanoarrow Python bindings have limited support for creating +`ArrowDeviceArray` wrappers that implement the +[`__arrow_c_device_array__` protocol](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#export-protocol) +from anything that implements DLPack: + +``` python +# Currently requires: +# export NANOARROW_PYTHON_CUDA=/usr/local/cuda +# pip install --force-reinstall --no-binary=":all:" nanoarrow +import nanoarrow as na +from nanoarrow import device +import cupy as cp + +device.c_device_array(cp.array([1, 2, 3])) +#> +#> - device_type: CUDA <2> +#> - device_id: 0 +#> - array: +#> - length: 3 +#> - offset: 0 +#> - null_count: 0 +#> - buffers: (0, 133980798058496) +#> - dictionary: NULL +#> - children[0]: + +darray = device.c_device_array(cp.array([1, 2, 3])) +cp.from_dlpack(darray.array.view().buffer(1)) +#> array([1, 2, 3]) +``` + +Thank you to [AlenkaF](https://github.com/AlenkaF), [shwina](https://github.com/shwina), +and [danepitkin](https://github.com/danepitkin) for their contributions to and +review of this feature! + +### Build System Support for IPC/Device + +Lastly, the CMake build system was refactored to enable `FetchContent` to +work in an even wider variety of +[develop/build/install scenarios](https://github.com/apache/arrow-nanoarrow/tree/main/examples/cmake-scenarios). In most cases, CMake-based projects should be able +to add the nanoarrow C library with device and/or IPC support as a dependency with: + +``` cmake +include(FetchContent) + +# If required: +# set(NANOARROW_IPC ON) +# set(NANOARROW_DEVICE ON) +fetchcontent_declare(nanoarrow + URL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/nanoarrow-0.6.0/apache-arrow-0.6.0.tar.gz") +fetchcontent_makeavailable(nanoarrow) + +add_executable(some_target ...) +target_link_libraries(some_target + PRIVATE + nanoarrow::nanoarrow + # If needed + # nanoarrow::nanoarrow_ipc + # nanoarrow::nanoarrow_device + ) +``` + +Linking against nanoarrow installed via `cmake --install` and located +via `find_package()` is also supported. + +Users of the Meson build system can install the latest nanoarrow with: + +``` shell +mkdir subprojects +meson wrap install nanoarrow +``` + +...and declared as a dependency with: + +``` shell +nanoarrow_dep = dependency('nanoarrow') +example_exec = executable('example_meson_minimal_app', + 'src/app.cc', + dependencies: [nanoarrow_dep]) +``` + +## Contributors + +This release consists of contributions from 10 contributors in addition +to the invaluable advice and support of the Apache Arrow community. + +```console +$ git shortlog -sn apache-arrow-nanoarrow-0.6.0.dev..apache-arrow-nanoarrow-0.6.0 | grep -v "GitHub Actions" + 64 Dewey Dunnington + 19 William Ayd + 16 Benjamin Kietzman + 5 Cocoa + 2 Abhishek Singh + 1 Ashwin Srinath + 1 Dane Pitkin + 1 Jacob Wujciak-Jens + 1 Matt Topol + 1 Tao Zuhong +```