Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-24.12 into branch-25.02 #17343

Merged
merged 24 commits into from
Nov 20, 2024
Merged

Commits on Nov 15, 2024

  1. add telemetry setup to test (#16924)

    This is a prototype implementation of rapidsai/build-infra#139
    
    The work that this builds on:
    * rapidsai/gha-tools#118, which adds a shell wrapper that automatically creates spans for the commands that it wraps. It also uses the `opentelemetry-instrument` command to set up monkeypatching for supported Python libraries, if the command is python-based
    * https://github.com/rapidsai/shared-workflows/tree/add-telemetry, which installs the gha-tools work from above and sets necessary environment variables. This is only done for the conda-cpp-build.yaml shared workflow at the time of submitting this PR.
    
    The goal of this PR is to observe telemetry data sent from a GitHub Actions build triggered by this PR as a proof of concept. Once it all works, the remaining work is:
    
    * merge rapidsai/gha-tools#118
    * Move the opentelemetry-related install stuff in https://github.com/rapidsai/shared-workflows/compare/add-telemetry?expand=1#diff-ca6188672785b5d214aaac2bf77ce0528a48481b2a16b35aeb78ea877b2567bcR118-R125 into https://github.com/rapidsai/ci-imgs, and rebuild ci-imgs
    * expand coverage to other shared workflows
    * Incorporate the changes from this PR to other jobs and to other repos
    
    Authors:
      - Mike Sarahan (https://github.com/msarahan)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
    
    URL: #16924
    msarahan authored Nov 15, 2024
    Configuration menu
    Copy the full SHA
    8664fad View commit details
    Browse the repository at this point in the history
  2. Update cmake to 3.28.6 in JNI Dockerfile (#17342)

    Updates cmake to 3.28.6 in the JNI Dockerfile used to build the cudf jar.  This helps avoid a bug in older cmake where FindCUDAToolkit can fail to find cufile libraries.
    
    Authors:
      - Jason Lowe (https://github.com/jlowe)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Gera Shegalov (https://github.com/gerashegalov)
    
    URL: #17342
    jlowe authored Nov 15, 2024
    Configuration menu
    Copy the full SHA
    e683647 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2024

  1. Use pylibcudf contiguous split APIs in cudf python (#17246)

    Apart of #15162
    
    Authors:
      - Matthew Murray (https://github.com/Matt711)
    
    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
    
    URL: #17246
    Matt711 authored Nov 16, 2024
    Configuration menu
    Copy the full SHA
    9cc9071 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2024

  1. Move strings translate benchmarks to nvbench (#17325)

    Moves `cpp/benchmarks/string/translate.cpp` implementation from google-bench to nvbench.
    This is benchmark for the `cudf::strings::translate` API.
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: #17325
    davidwendt authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    e4de8e4 View commit details
    Browse the repository at this point in the history
  2. Move cudf._lib.unary to cudf.core._internals (#17318)

    Contributes to #17317
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17318
    mroeschke authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    aeb6a30 View commit details
    Browse the repository at this point in the history
  3. Reading multi-source compressed JSONL files (#17161)

    Fixes #17068 
    Fixes #12299
    
    This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.
    
    ~~TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.~~
    ~~Handled by #17219, which implements compressed JSON writer required for the above test.~~
    Multi-source compressed input tests added!
    
    Authors:
      - Shruti Shivakumar (https://github.com/shrshi)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)
      - Karthikeyan (https://github.com/karthikeyann)
    
    URL: #17161
    shrshi authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    03ac845 View commit details
    Browse the repository at this point in the history
  4. Test the full matrix for polars and dask wheels on nightlies (#17320)

    This PR ensures that we have nightly coverage of more of the CUDA/Python/arch versions that we claim to support for dask-cudf and cudf-polars wheels.
    
    In addition, this PR ensures that we do not attempt to run the dbgen executable in the Polars repository on systems with too old of a glibc to support running them.
    
    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
    
    URL: #17320
    vyasr authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    d514517 View commit details
    Browse the repository at this point in the history
  5. Fix reading Parquet string cols when nrows and input_pass_limit >…

    … 0 (#17321)
    
    This PR fixes reading string columns in Parquet using chunked parquet reader when `nrows` and `input_pass_limit` are > 0.
    
    Closes #17311
    
    Authors:
      - Muhammad Haseeb (https://github.com/mhaseeb123)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Ed Seidl (https://github.com/etseidl)
      - Lawrence Mitchell (https://github.com/wence-)
      - Bradley Dice (https://github.com/bdice)
      - https://github.com/nvdbaranec
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17321
    mhaseeb123 authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    43f2f68 View commit details
    Browse the repository at this point in the history
  6. Remove cudf._lib.hash in favor of inlining pylibcudf (#17345)

    Contributes to #17317
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17345
    mroeschke authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    18b40dc View commit details
    Browse the repository at this point in the history
  7. Remove cudf._lib.concat in favor of inlining pylibcudf (#17344)

    Contributes to #17317
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17344
    mroeschke authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    ba21673 View commit details
    Browse the repository at this point in the history
  8. Remove cudf._lib.quantiles in favor of inlining pylibcudf (#17347)

    Contributes to #17317
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17347
    mroeschke authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    02c35bf View commit details
    Browse the repository at this point in the history
  9. Remove cudf._lib.labeling in favor of inlining pylibcudf (#17346)

    Contributes to #17317
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17346
    mroeschke authored Nov 18, 2024
    Configuration menu
    Copy the full SHA
    302e625 View commit details
    Browse the repository at this point in the history

Commits on Nov 19, 2024

  1. Support polars 1.14 (#17355)

    1.13 was yanked for some reason, but 1.14 doesn't bring anything new and difficult.
    
    Authors:
      - Lawrence Mitchell (https://github.com/wence-)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - https://github.com/brandon-b-miller
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17355
    wence- authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    5f9a97f View commit details
    Browse the repository at this point in the history
  2. Writing compressed output using JSON writer (#17323)

    Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)
    
    Adds support for writing GZIP- and SNAPPY-compressed JSON to the JSON writer.
    Verifies correctness using a parameterized test in `tests/io/json/json_writer.cpp`
    
    Authors:
      - Shruti Shivakumar (https://github.com/shrshi)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)
      - Karthikeyan (https://github.com/karthikeyann)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    URL: #17323
    shrshi authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    384abae View commit details
    Browse the repository at this point in the history
  3. fix library-loading issues in editable installs (#17338)

    Contributes to rapidsai/build-planning#118
    
    The pattern introduced in #17316 breaks editable installs in devcontainers. In that type of build, `libcudf.so` is built outside of the wheel but **not installed**, so it can't be found by `ld`. Extension modules in `cudf` and `pylibcudf` are able to find it via RPATHs instead.
    
    This proposes:
    
    * try-catching the entire library-loading attempt, to silently do nothing in cases like that
    * ~adding imports of the `cudf` and `pylibcudf` libraries in the `devcontainers` CI job, as a smoke test to catch issues like this in the future~ *(edit: removed those, [`devcontainer` builds run on CPU nodes](https://github.com/rapidsai/shared-workflows/blob/4e84062f333ce5649bc65029d3979569e2d0a045/.github/workflows/build-in-devcontainer.yaml#L19))*
    
    ## Notes for Reviewers
    
    ### How I tested this
    
    Tested this approach on rapidsai/kvikio#553
    
    #
    
    Authors:
      - James Lamb (https://github.com/jameslamb)
      - Matthew Murray (https://github.com/Matt711)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Matthew Murray (https://github.com/Matt711)
    
    URL: #17338
    jameslamb authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    9c5cd81 View commit details
    Browse the repository at this point in the history
  4. Fix integer overflow in compiled binaryop (#17354)

    For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also #10368.
    
    - Closes #17353
    
    Authors:
      - Lawrence Mitchell (https://github.com/wence-)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - David Wendt (https://github.com/davidwendt)
      - Tianyu Liu (https://github.com/kingcrimsontianyu)
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: #17354
    wence- authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    c7bfa77 View commit details
    Browse the repository at this point in the history
  5. Move strings replace benchmarks to nvbench (#17301)

    Move `cpp/benchmark/string/replace.cpp` implementation from google-test to nvbench
    This covers strings replace APIs:
    - `cudf::strings::replace` scalar version
    - `cudf::strings::replace_multiple` column version
    - `cudf::strings::replace_slice`
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Yunsong Wang (https://github.com/PointKernel)
      - Shruti Shivakumar (https://github.com/shrshi)
    
    URL: #17301
    davidwendt authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    03c055f View commit details
    Browse the repository at this point in the history
  6. Optimize distinct inner join to use set find instead of retrieve (#…

    …17278)
    
    This PR introduces a minor optimization for distinct inner joins by using the `find` results to selectively copy matches to the output. This approach eliminates the need for the costly `retrieve` operation, which relies on expensive atomic operations.
    
    Authors:
      - Yunsong Wang (https://github.com/PointKernel)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Karthikeyan (https://github.com/karthikeyann)
    
    URL: #17278
    PointKernel authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    56061bd View commit details
    Browse the repository at this point in the history

Commits on Nov 20, 2024

  1. Add compute_column_expression to pylibcudf for transform.compute_colu…

    …mn (#17279)
    
    Follow up to #16760
    
    `transform.compute_column` (backing `.eval`) requires an `Expression` object created by a private routine in cudf Python. Since this routine will be needed for any user of the public `transform.compute_column`, moving it to pylibcudf.
    
    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - Lawrence Mitchell (https://github.com/wence-)
    
    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: #17279
    mroeschke authored Nov 20, 2024
    Configuration menu
    Copy the full SHA
    7158ee0 View commit details
    Browse the repository at this point in the history
  2. Bug fix: restrict lines=True to JSON format in Kafka read_gdf method (#…

    …17333)
    
    This pull request modifies the read_gdf method in kafka.py to pass the lines parameter only when the message_format is "json". This prevents lines from being passed to other formats (e.g., CSV, Avro, ORC, Parquet), which do not support this parameter.
    
    Authors:
      - Hirota Akio (https://github.com/a-hirota)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: #17333
    a-hirota authored Nov 20, 2024
    Configuration menu
    Copy the full SHA
    05365af View commit details
    Browse the repository at this point in the history
  3. Adapt to KvikIO API change in the compatibility mode (#17377)

    This PR adapts cuDF to a breaking API change in KvikIO (rapidsai/kvikio#547) introduced recently, which adds the `AUTO` compatibility mode to file I/O.
    
    This PR causes no behavioral changes in cuDF: If the environment variable `KVIKIO_COMPAT_MODE` is left unset, cuDF by default still enables the compatibility mode in KvikIO. This is the same with the previous behavior (#17185).
    
    Authors:
      - Tianyu Liu (https://github.com/kingcrimsontianyu)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
    
    URL: #17377
    kingcrimsontianyu authored Nov 20, 2024
    Configuration menu
    Copy the full SHA
    6f83b58 View commit details
    Browse the repository at this point in the history
  4. Benchmarking JSON reader for compressed inputs (#17219)

    Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)\
    Depends on #17323 for compressed JSON writer implementation.
    
    Adds benchmark to measure performance of the JSON reader for compressed inputs.
    
    Authors:
      - Shruti Shivakumar (https://github.com/shrshi)
      - Muhammad Haseeb (https://github.com/mhaseeb123)
    
    Approvers:
      - MithunR (https://github.com/mythrocks)
      - Vukasin Milovanovic (https://github.com/vuule)
      - Karthikeyan (https://github.com/karthikeyann)
      - Muhammad Haseeb (https://github.com/mhaseeb123)
    
    URL: #17219
    shrshi authored Nov 20, 2024
    Configuration menu
    Copy the full SHA
    fc08fe8 View commit details
    Browse the repository at this point in the history
  5. Deselect failing polars tests (#17362)

    Deselect `test_join_4_columns_with_validity` which is failing in nightly CI tests and is reproducible in some systems (xref pola-rs/polars#19870), but apparently not all. Deselect `test_read_web_file` as well that fails on rockylinux8 due to SSL CA issues.
    
    Authors:
      - Peter Andreas Entschev (https://github.com/pentschev)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)
    
    URL: #17362
    pentschev authored Nov 20, 2024
    Configuration menu
    Copy the full SHA
    a2a62a1 View commit details
    Browse the repository at this point in the history
  6. Add new dask_cudf.read_parquet API (#17250)

    It's time to clean up the `dask_cudf.read_parquet` API and prioritize GPU-specific optimizations. To this end, it makes sense to expose our own `read_parquet` API within Dask cuDF. 
    
    **Notes**:
    
    - The "new" `dask_cudf.read_parquet` API is only relevant when query-planning is enabled (the default).
    - Using `filesystem="arrow"` now uses `cudf.read_parquet` when reading from local storage (rather than PyArrow).
    - (specific to Dask cuDF): The default `blocksize` argument is now specific to the "smallest" NVIDIA device detected within the active dask cluster (or the first device visible to the the client). More specifically, we use `pynvml` to find this representative device size, and we set `blocksize` to be 1/32 this size.
      - The user may also pass in something like `blocksize=0.125` to use `1/8` the minimum device size (or `blocksize='1GiB'` to bypass the default logic altogether).
    - (specific to Dask cuDF): When `blocksize` is `None`, we disable partition fusion at optimization time.
    - (specific to Dask cuDF): When `blocksize` is **not** `None`, we use the parquet metadata from the first few files to inform partition fusion at optimization time (instead of a rough column-count ratio).
    
    Authors:
      - Richard (Rick) Zamora (https://github.com/rjzamora)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Mads R. B. Kristensen (https://github.com/madsbk)
    
    Approvers:
      - Mads R. B. Kristensen (https://github.com/madsbk)
      - Lawrence Mitchell (https://github.com/wence-)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: #17250
    rjzamora authored Nov 20, 2024
    Configuration menu
    Copy the full SHA
    3111aa4 View commit details
    Browse the repository at this point in the history