Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce utility function to calculate inputs to rolling_window #16305

Draft
wants to merge 16 commits into
base: branch-25.02
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jul 18, 2024

Description

Both polars and pandas specify rolling window extents by providing a scalar window length, along with whether the interval is open, closed, left-half-open, or right-half-open, and (for polars) a scalar offset. We currently use a numba kernel to convert this information into the pair of preceding_window and following_window columns that are the arguments to rolling_window. This kernel has a few problems:

  • For large window extents it has poor performance scaling as $\mathcal{O}(n * \text{window\_size})$;
  • It doesn't support different end point handling;
  • Since pandas doesn't require it (although it kind of does via the center parameter), it doesn't support computing following_window information;
  • If we want to use it in cudf-polars, it introduces another dependency (numba).

To address these concerns, implement a utility function in libcudf that takes window length and offset and computes the matching preceding/following columns. This implementation exploits the fact that the column we are rolling over must be sorted in ascending order and uses thrust::lower_bound to provide an $\mathcal{O}(n \log n)$ run time for all window sizes. We also now support all endpoint handling required for cudf-polars.

This provides the infrastructure to work on addressing (at least partially):

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested review from a team as code owners July 18, 2024 10:45
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jul 18, 2024
@wence- wence- force-pushed the wence/fea/polars-rolling branch 2 times, most recently from 72f221b to 9988fd9 Compare July 18, 2024 10:52
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think windows_utils.cu should be in src/rolling and not src/rolling/detail

cpp/src/rolling/detail/window_utils.cu Outdated Show resolved Hide resolved
cpp/src/rolling/detail/window_utils.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor Author

wence- commented Jul 18, 2024

I think windows_utils.cu should be in src/rolling and not src/rolling/detail

Should the declarations still live in cudf/rolling.hpp and cudf/detail/rolling.hpp ?

@davidwendt
Copy link
Contributor

I think windows_utils.cu should be in src/rolling and not src/rolling/detail

Should the declarations still live in cudf/rolling.hpp and cudf/detail/rolling.hpp ?

Yes. I don't think you need the cudf/detail/rolling.hpp one unless you think an internal libcudf function will be calling it. Even then I would wait until that internal function needs it and add it to cudf/detail/rolling.hpp at that time.

@davidwendt davidwendt added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jul 18, 2024
@wence- wence- added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 18, 2024
@wence-
Copy link
Contributor Author

wence- commented Jul 18, 2024

Debugging some fencepost errors...

@wence- wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 18, 2024
@wence-
Copy link
Contributor Author

wence- commented Jul 18, 2024

Debugging some fencepost errors...

Fixed, I was bitten by switch-fallthrough (so I replaced it with cascading if-else).

cpp/include/cudf/rolling.hpp Outdated Show resolved Hide resolved
Comment on lines 183 to 184
CUDF_EXPECTS(length.type().id() == type_to_id<OffsetType>(),
"Length must have same the resolution as the input.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tighter requirement than strictly necessary. We just need it to be the case that we can add OffsetType to T (so I think it should be OK for example having a TIMESTAMP_MICROSECONDS column and adding a DURATION_SECONDS offset/length).

But to fully handle that I think one needs triple type-dispatch over the three-tuple of (input_type, length, offset), so I decided not to do that unless it turns out to be absolutely necessary.

Comment on lines 199 to 200
CUDF_EXPECTS(have_same_types(input, length),
"Input column, length, and offset must have the same type.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I think it would be OK if length and offset are addable to input. But this is easier.

cpp/src/rolling/window_utils.cu Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor Author

wence- commented Jul 18, 2024

Using the example from #15119, we can see the improvement in computing the window bounds:

In [17]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "s")); c.default_stream().synchronize()
799 μs ± 4.47 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "m")); c.default_stream().synchronize()
4.62 ms ± 9.78 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "h")); c.default_stream().synchronize()
254 ms ± 790 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [20]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "D")); c.default_stream().synchronize()
6.2 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So we see that as the window gets bigger, the time compute the window bounds increases linearly in the window size.

In contrast, with this new implementation:

In [21]: %%timeit
    ...: n = 1
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.82 ms ± 6.42 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %%timeit
    ...: n = 60
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.83 ms ± 7.13 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %%timeit
    ...: n = 60*60
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.84 ms ± 4.12 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %%timeit
    ...: n = 60*60*24
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.82 ms ± 6.73 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So now the time to compute the window is a bit slower for a window size of 1, but much faster as the windows get bigger.

@@ -48,5 +48,14 @@ std::unique_ptr<column> rolling_window(column_view const& input,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr);

std::pair<std::unique_ptr<column>, std::unique_ptr<column>> windows_from_offset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider including an @brief description here?

cpp/include/cudf/rolling.hpp Show resolved Hide resolved
scalar const& length,
scalar const& offset,
window_type const window_type,
bool only_preceding,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this also be a const ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make copy-by-value parameters like this const.
This is not needed by the caller and provides too much limitation to the internal implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken, @davidwendt. For my own understanding, should the preceding parameter (window_type) be subject to the same convention?

Maybe it doesn't apply, since it isn't obviously an enum?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think that it should apply to this enum parameter as well.

@mythrocks
Copy link
Contributor

mythrocks commented Jul 18, 2024

Forgive me if this is a complete tangent: This utility is in service of computing the window bounds for range window queries?

I was wondering if you'd already had a look at the grouped-range functions implemented in grouped_rolling.cu. It handles the following cases:

  1. whether the order-by column is sorted in ASC/DESC.
  2. whether or not there is also a grouping column (so that the window does not cross group boundaries).
  3. whether the order-by column contains nulls (sorted to either the beginning or the end of the column).

Note that the window bounds are calculated separately, depending upon the ordering, null-placement, and the presence of the group-by columns.

Edit: For clarity, this is in service of Spark's window-function SQL:

SELECT SUM(dollars) OVER (
   PARTITION BY dept_id
   ORDER BY sale_date ASC
   RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND 7 DAYS FOLLOWING ) AS two_week_earnings
...

@wence- wence- added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 19, 2024
@wence-
Copy link
Contributor Author

wence- commented Jul 19, 2024

Deciding if this should slip to 24.10

@wence- wence- marked this pull request as draft July 19, 2024 17:37
@wence-
Copy link
Contributor Author

wence- commented Jul 19, 2024

Will retarget to 24.10 and look for opportunities to reuse or share the existing implementations.

@wence- wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 19, 2024
/**
* @brief Indicates which endpoints a rolling window contains.
*/
enum class window_type : int32_t {
Copy link
Contributor

@mythrocks mythrocks Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial nit: One wonders if we might use a more descriptive name for this type. Something like window_margin(_type)?
Edit: I mention this because window_bounds was initially named window_type, then window_bounds_type, and then window_bounds.

@mythrocks
Copy link
Contributor

Will retarget to 24.10 and look for opportunities to reuse or share the existing implementations.

Your openness to this is deeply appreciated. I'd be happy to assist. My current implementation certainly doesn't tackle open/closed margins, or negative range values.

wence- added 16 commits December 3, 2024 10:54
Both polars and pandas specify rolling window extents by providing a
scalar window length, along with whether the interval is open, closed,
left-half-open, or right-half-open, and (for polars) a scalar offset.
We currently use a numba kernel to convert this information into the
pair of `preceding_window` and `following_window` columns that are the
arguments to `rolling_window`. This kernel has a few problems:

- For large window extents it has poor performance scaling as O(n *
  window_size);
- It doesn't support different end point handling;
- Since pandas doesn't require it, it doesn't support computing
  `following_window` information;
- If we want to use it in cudf-polars, it introduces another
  dependency (numba).

To address these concerns, implement a utility function in libcudf
that takes window length and offset and computes the matching
preceding/following columns. This implementation exploits the fact
that the column we are rolling over must be sorted in ascending order
and uses `thrust::lower_bound` to provide an O(n log n) run time for
all window sizes. We also now support all endpoint handling required
for cudf-polars.

This provides the infrastructure to work on addressing (at least
partially):

- rapidsai#12774
- rapidsai#14334
- rapidsai#15086
- rapidsai#15119
- rapidsai#15192
Modernise use of op_impl to omit unnecessary T template parameter.
e.g. thrust::less, not std::less.
@wence- wence- force-pushed the wence/fea/polars-rolling branch from 7d55d4e to d9010db Compare December 9, 2024 18:13
Copy link

copy-pr-bot bot commented Dec 9, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars labels Dec 9, 2024
@wence- wence- changed the base branch from branch-24.08 to branch-25.02 December 12, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants