[RELEASE] cudf v24.12 #17406

raydouglass · 2024-11-21T20:50:02Z

❄️ Code freeze for `branch-24.12` and v24.12 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.12 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-24.12 into main for the release

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16994

Add empty string column condition for write_json bypass make_strings_children for empty column because when grid size is zero, it throws cuda error. Authors: - Karthikeyan (https://github.com/karthikeyann) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #16995

This PR removes an unused unused import in cudf which was causing errors in doc builds. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17005

This PR adds two new jobs to the project automations. One to extract the version number from the branch name, and one to set the project `Release` field to the version found. Authors: - Ben Jarmak (https://github.com/jarmak-nv) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17001

With this set of changes I get a clean run of clang-tidy (with one caveat that I'll explain in the follow-up PR to add clang-tidy to pre-commit/CI). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) - David Wendt (https://github.com/davidwendt) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #16956

Closes #16735 Authors: - https://github.com/brandon-b-miller - Lawrence Mitchell (https://github.com/wence-) Approvers: - Matthew Murray (https://github.com/Matt711) - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) URL: #16776

Apart of #15162 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17006

Everything in the expression evaluation now operates on columns without names. DataFrame construction takes either a mapping from string-valued names to columns, or a sequence of pairs of names and columns. This removes some duplicate code in the NamedColumn class (by removing it) where we had to fight the inheritance hierarchy. - Closes #16272 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Matthew Murray (https://github.com/Matt711) URL: #16962

We use the pairwise approach of Chan, Golub, and LeVeque (1983). - Closes #16444 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - Robert (Bobby) Evans (https://github.com/revans2) URL: #16448

The cudf tests already treat tests that are expected to fail but pass as errors, but at the time we introduced that change, we didn't do the same for the other packages. Do that now, it turns out there are only a few xpassing tests. While here, it turns out that having multiple different pytest configuration files does not work. `pytest.ini` takes precedence over other options, and it's "first file wins". Consequently, the merge of #16851 turned off `xfail_strict = true` (and other options) for many of the subpackages. To fix this, migrate all pytest configuration into the appropriate section of the `pyproject.toml` files, so that all tool configuration lives in the same place. We also add a section in the developer guide to document this choice. - Closes #12391 - Closes #16974 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - James Lamb (https://github.com/jameslamb) - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16977

As part of JSON validation, field, value and string tokens are validated. Right now the code has single transform_inclusive_scan. Since this transform functor is a heavy operation, it slows down the entire scan drastically. This PR splits transform and scan in validation. The runtime of validation went from 200ms to 20ms. Also, a few hardcoded string comparisons are moved to trie. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) - Robert (Bobby) Evans (https://github.com/revans2) URL: #16996

Apart of #15162 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #17007

Contributes to rapidsai/build-planning#106 Proposes specifying the RAPIDS version in `conda install` calls in CI that install CI artifacts, to reduce the risk of CI jobs picking up artifacts from other releases. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17013

Contributes to #15162 Also I believe the cpp docstrings were incorrect, but could use a second look. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) URL: #17003

We have recently observed a number of seg faults in our Python tests. From some investigation, the error comes from the import of pyarrow loading the bundled libarrow.so, and in particular when that library runs a jemalloc function `background_thread_entry`. We have observed similar (but not identical) errors in the past that have to do with as-yet unsolved problems in the way that arrow handles multi-threaded environments. The error is currently only observed on arm runners and with pyarrow 17.0.0. In my tests the error is highly sensitive to everything from import order to unrelated code segments, suggesting a race condition, some form of memory corruption, or perhaps symbol resolution errors at runtime. As a result, I have had limited success in drilling down further into specific causes, especially since attempts to rebuild libarrow.so also squash the error and I therefore cannot use debug symbols. From some offline discussion we decided that avoiding the problematic version is a sufficient fix for now. Due to the sensitivity, I am simply skipping 17.0.0 in this PR. I suspect that future builds of pyarrow will also usually not exhibit this bug (although it may recur occasionally on specific versions of pyarrow). Therefore, rather than lowering the upper bound I would prefer to allow us to float and see if and when this problem reappears. Since our DFG+RBB combination for wheel builds does not yet support any matrix entry other than `cuda`, I'm using environment markers to specify the constraint rather than a matrix entry in dependencies.yaml. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17018

…nd` (#16485) Refactors `histogram` reduce and groupby aggregations using `cuco::static_set::insert_and_find`. Speed improvement results [here](#16485 (comment)) and [here](#16485 (comment)). Authors: - Srinivas Yadav (https://github.com/srinivasyadav18) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #16485

…17026) the same issue as NVIDIA/spark-rapids-jni#2475 due to rapidsai/kvikio#464 Port the fix from NVIDIA/spark-rapids-jni#2476, verified locally Authors: - Peixin (https://github.com/pxLi) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: #17026

Forward-merge branch-24.10 into branch-24.12

cuda::std::optional shouldn't be used for host types such as `std::vector` as it requires the constructors of the `T` types to be host+device. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #17015

When instantiating a `cudf.pandas` proxy array, a DtoH transfer occurs so that the data buffer is set correctly. We do this because functions which utilize NumPy's C API can utilize the data buffer directly instead of going through `__array__`. This PR documents this limitation. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) URL: #16955

…17020) One of the `host_span` constructors was not updated when we added `is_device_accessible`, so the value was not assigned. This PR fixes this simple error and adds tests that checks that this property is correctly set when creating `host_span`s. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17020

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller URL: #16990

This PR updates all the RMM imports to use pylibrmm/librmm now that `rmm._lib` is deprecated . It should be merged after [rmm/1676](rapidsai/rmm#1676). Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #16913

Fixes the libcudf regex parsing logic when handling nested fixed quantifiers. The logic handles fixed quantifiers by simple repeating the previous instruction. If the previous item is a group (capture or non-capture) that group may also contain an internal fixed quantifier as well. Found while working on #16730 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #16798

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16997

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - James Lamb (https://github.com/jameslamb) - Vyas Ramasubramani (https://github.com/vyasr) URL: #17025

…oint (#17048) Contributes to #15162 I don't think there are any types in this file that needs to be exposed on the Python side; they're just used internally in pylibcudf. Also moves this to `libcudf/fixed_point` matching the libcudf location more closely Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17048

…7010) Contributes to #15162 ~I just assumed since the associated libcudf files just publicly expose C types, we just want to match the name spacing when importing from pylibcudf (avoid importing from `pylibcudf.libcudf`) and not necessary expose a Python equivalent?~ ~Let me know if I am misunderstanding how to expose these types.~ #17010 (comment) Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #17010

Follow-up to #17013 Changes relative to that PR: * switches to pinning CI conda installs to the output of `rapids-version` (`{major}.{minor}.{patch}`) instead of `rapids-version-major-minor` (`{major}.{minor}`), to get a bit more protection in the presence of hotfix releases * restores some exporting of variables needed for docs builds I made some mistakes in #17013 (comment). Missed that this project's Doxygen setup is expecting to find `RAPIDS_VERSION` and `RAPIDS_VERSION_MAJOR_MINOR` defined in the environment. https://github.com/rapidsai/cudf/blob/7173b52fce25937bb69e22a083a5de4655078fa1/cpp/doxygen/Doxyfile#L41 https://github.com/rapidsai/cudf/blob/7173b52fce25937bb69e22a083a5de4655078fa1/cpp/doxygen/Doxyfile#L2229 Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #17042

@wence-

Adding python bindings to [`cudf::pack()`](https://docs.rapids.ai/api/libcudf/legacy/group__copy__split#ga86716e7ec841541deb6edc7e91fcb9e4), [`cudf::unpack()`](https://docs.rapids.ai/api/libcudf/legacy/group__copy__split#ga1d62a18c2e6f087a92289c63693762cc), and [`cudf::packed_columns`](https://docs.rapids.ai/api/libcudf/legacy/structcudf_1_1packed__columns). This is the first step to support serialization of cudf.polars' IR. cc. @wence- @rjzamora # Authors: - Mads R. B. Kristensen (https://github.com/madsbk) - Matthew Murray (https://github.com/Matt711) - Lawrence Mitchell (https://github.com/wence-) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Lawrence Mitchell (https://github.com/wence-) URL: #17012

Move `cpp/benchmarks/string/url_decode.cu` implementation from google-bench to nvbench. This benchmark is for the `cudf::strings::url_decode` API. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) URL: #17328

closes #17360 Technically I suppose this was more of an enhancement since the documentation suggested only a single label was supported, but I'll mark as a bug since the error message was not informative. Authors: - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17373

Moves the `cpp/benchmarks/string/repeat_strings.cpp` implementation from google-bench to nvbench. This covers the overloads of the `cudf::strings::repeat_strings` API. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #17304

#17250 started using `pynvml` but did not add the proper dependency, this change fixes the missing dependency. Authors: - Peter Andreas Entschev (https://github.com/pentschev) - https://github.com/jakirkham Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/jakirkham URL: #17386

This is likely the easiest fix for avoiding CI errors from this part of the code. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #17389

Apart of #15162 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - David Wendt (https://github.com/davidwendt) - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) - Lawrence Mitchell (https://github.com/wence-) URL: #17163

copy-pr-bot · 2024-11-21T20:50:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

review-notebook-app · 2024-11-21T20:50:12Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

This PR enables Unified memory as the default memory resource for `cudf_polars` --------- Co-authored-by: Vyas Ramasubramani <[email protected]> Co-authored-by: Vyas Ramasubramani <[email protected]> Co-authored-by: Matthew Murray <[email protected]> Co-authored-by: Lawrence Mitchell <[email protected]> Co-authored-by: Matthew Murray <[email protected]>

mroeschke and others added 30 commits October 5, 2024 01:30

Add string.convert.convert_ipv4 APIs to pylibcudf (#16994)

33b8dfa

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16994

Remove unused import (#17005)

bfd568b

This PR removes an unused unused import in cudf which was causing errors in doc builds. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17005

Migrate nvtext generate_ngrams APIs to pylibcudf (#17006)

09ed210

Apart of #15162 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17006

Migrate nvtext jaccard API to pylibcudf (#17007)

618a93f

Apart of #15162 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #17007

Merge pull request #17027 from rapidsai/branch-24.10

9c37e1e

Forward-merge branch-24.10 into branch-24.12

Add string.convert_floats APIs to pylibcudf (#16990)

3791c8a

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller URL: #16990

Add string.convert.convert_lists APIs to pylibcudf (#16997)

69b0f66

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #16997

Add json APIs to pylibcudf (#17025)

7d49df7

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - James Lamb (https://github.com/jameslamb) - Vyas Ramasubramani (https://github.com/vyasr) URL: #17025

davidwendt and others added 6 commits November 20, 2024 19:02

raydouglass requested review from a team as code owners November 21, 2024 20:50

raydouglass requested review from jameslamb, wence-, galipremsagar, mhaseeb123 and vuule November 21, 2024 20:50

ttnghia approved these changes Nov 21, 2024

View reviewed changes

mhaseeb123 approved these changes Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] cudf v24.12 #17406

[RELEASE] cudf v24.12 #17406

raydouglass commented Nov 21, 2024

copy-pr-bot bot commented Nov 21, 2024

review-notebook-app bot commented Nov 21, 2024

[RELEASE] cudf v24.12 #17406

Are you sure you want to change the base?

[RELEASE] cudf v24.12 #17406

Conversation

raydouglass commented Nov 21, 2024

❄️ Code freeze for branch-24.12 and v24.12 release

What does this mean?

What is the purpose of this PR?

copy-pr-bot bot commented Nov 21, 2024

review-notebook-app bot commented Nov 21, 2024

❄️ Code freeze for `branch-24.12` and v24.12 release