Introduce utility function to calculate inputs to rolling_window #16305

wence- · 2024-07-18T10:45:28Z

Description

Both polars and pandas specify rolling window extents by providing a scalar window length, along with whether the interval is open, closed, left-half-open, or right-half-open, and (for polars) a scalar offset. We currently use a numba kernel to convert this information into the pair of preceding_window and following_window columns that are the arguments to rolling_window. This kernel has a few problems:

For large window extents it has poor performance scaling as $\mathcal{O}(n * \text{window\_size})$;
It doesn't support different end point handling;
Since pandas doesn't require it (although it kind of does via the center parameter), it doesn't support computing following_window information;
If we want to use it in cudf-polars, it introduces another dependency (numba).

To address these concerns, implement a utility function in libcudf that takes window length and offset and computes the matching preceding/following columns. This implementation exploits the fact that the column we are rolling over must be sorted in ascending order and uses thrust::lower_bound to provide an $\mathcal{O}(n \log n)$ run time for all window sizes. We also now support all endpoint handling required for cudf-polars.

This provides the infrastructure to work on addressing (at least partially):

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt

I think windows_utils.cu should be in src/rolling and not src/rolling/detail

cpp/src/rolling/detail/window_utils.cu

cpp/src/rolling/rolling.cu

wence- · 2024-07-18T13:09:49Z

I think windows_utils.cu should be in src/rolling and not src/rolling/detail

Should the declarations still live in cudf/rolling.hpp and cudf/detail/rolling.hpp ?

davidwendt · 2024-07-18T13:16:59Z

I think windows_utils.cu should be in src/rolling and not src/rolling/detail

Should the declarations still live in cudf/rolling.hpp and cudf/detail/rolling.hpp ?

Yes. I don't think you need the cudf/detail/rolling.hpp one unless you think an internal libcudf function will be calling it. Even then I would wait until that internal function needs it and add it to cudf/detail/rolling.hpp at that time.

wence- · 2024-07-18T15:35:20Z

Debugging some fencepost errors...

wence- · 2024-07-18T16:04:35Z

Debugging some fencepost errors...

Fixed, I was bitten by switch-fallthrough (so I replaced it with cascading if-else).

cpp/include/cudf/rolling.hpp

wence- · 2024-07-18T16:12:49Z

cpp/src/rolling/window_utils.cu

+    CUDF_EXPECTS(length.type().id() == type_to_id<OffsetType>(),
+                 "Length must have same the resolution as the input.");


This is a tighter requirement than strictly necessary. We just need it to be the case that we can add OffsetType to T (so I think it should be OK for example having a TIMESTAMP_MICROSECONDS column and adding a DURATION_SECONDS offset/length).

But to fully handle that I think one needs triple type-dispatch over the three-tuple of (input_type, length, offset), so I decided not to do that unless it turns out to be absolutely necessary.

wence- · 2024-07-18T16:14:25Z

cpp/src/rolling/window_utils.cu

+    CUDF_EXPECTS(have_same_types(input, length),
+                 "Input column, length, and offset must have the same type.");


Similarly here, I think it would be OK if length and offset are addable to input. But this is easier.

cpp/src/rolling/window_utils.cu

wence- · 2024-07-18T17:03:18Z

Using the example from #15119, we can see the improvement in computing the window bounds:

In [17]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "s")); c.default_stream().synchronize()
799 μs ± 4.47 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [18]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "m")); c.default_stream().synchronize()
4.62 ms ± 9.78 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "h")); c.default_stream().synchronize()
254 ms ± 790 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [20]: %timeit cudf.utils.cudautils.window_sizes_from_offset(dt.data_array_view(mode="write"), np.timedelta64(1, "D")); c.default_stream().synchronize()
6.2 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So we see that as the window gets bigger, the time compute the window bounds increases linearly in the window size.

In contrast, with this new implementation:

In [21]: %%timeit
    ...: n = 1
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.82 ms ± 6.42 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %%timeit
    ...: n = 60
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.83 ms ± 7.13 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %%timeit
    ...: n = 60*60
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.84 ms ± 4.12 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %%timeit
    ...: n = 60*60*24
    ...: plc.rolling.windows_from_offset(b, plc.interop.from_arrow(pa.scalar(n, type=pa.duration("s"))), plc.interop.from_arrow(pa.scalar(-n, type=pa.duration("s"))), plc.rolling.WindowType.LEFT_CLOSED, True)
    ...: 
    ...: 
2.82 ms ± 6.73 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So now the time to compute the window is a bit slower for a window size of 1, but much faster as the windows get bigger.

cpp/include/cudf/rolling.hpp

cpp/src/rolling/window_utils.cu

mythrocks · 2024-07-18T18:50:06Z

cpp/include/cudf/detail/rolling.hpp

@@ -48,5 +48,14 @@ std::unique_ptr<column> rolling_window(column_view const& input,
                                       rmm::cuda_stream_view stream,
                                       rmm::device_async_resource_ref mr);

+std::pair<std::unique_ptr<column>, std::unique_ptr<column>> windows_from_offset(


Should we consider including an @brief description here?

cpp/include/cudf/rolling.hpp

JayjeetAtGithub · 2024-07-18T21:31:55Z

cpp/include/cudf/detail/rolling.hpp

+  scalar const& length,
+  scalar const& offset,
+  window_type const window_type,
+  bool only_preceding,


Can this also be a const ?

I don't think we should make copy-by-value parameters like this const.
This is not needed by the caller and provides too much limitation to the internal implementation.

Point taken, @davidwendt. For my own understanding, should the preceding parameter (window_type) be subject to the same convention?

Maybe it doesn't apply, since it isn't obviously an enum?

IMO, I think that it should apply to this enum parameter as well.

mythrocks · 2024-07-18T21:37:26Z

Forgive me if this is a complete tangent: This utility is in service of computing the window bounds for range window queries?

I was wondering if you'd already had a look at the grouped-range functions implemented in grouped_rolling.cu. It handles the following cases:

whether the order-by column is sorted in ASC/DESC.
whether or not there is also a grouping column (so that the window does not cross group boundaries).
whether the order-by column contains nulls (sorted to either the beginning or the end of the column).

Note that the window bounds are calculated separately, depending upon the ordering, null-placement, and the presence of the group-by columns.

Edit: For clarity, this is in service of Spark's window-function SQL:

SELECT SUM(dollars) OVER (
   PARTITION BY dept_id
   ORDER BY sale_date ASC
   RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND 7 DAYS FOLLOWING ) AS two_week_earnings
...

wence- · 2024-07-19T17:29:06Z

Deciding if this should slip to 24.10

wence- · 2024-07-19T17:37:47Z

Will retarget to 24.10 and look for opportunities to reuse or share the existing implementations.

mythrocks · 2024-07-19T17:58:52Z

cpp/include/cudf/rolling.hpp

+/**
+ * @brief Indicates which endpoints a rolling window contains.
+ */
+enum class window_type : int32_t {


Trivial nit: One wonders if we might use a more descriptive name for this type. Something like window_margin(_type)?
Edit: I mention this because window_bounds was initially named window_type, then window_bounds_type, and then window_bounds.

mythrocks · 2024-07-19T18:13:56Z

Will retarget to 24.10 and look for opportunities to reuse or share the existing implementations.

Your openness to this is deeply appreciated. I'd be happy to assist. My current implementation certainly doesn't tackle open/closed margins, or negative range values.

Both polars and pandas specify rolling window extents by providing a scalar window length, along with whether the interval is open, closed, left-half-open, or right-half-open, and (for polars) a scalar offset. We currently use a numba kernel to convert this information into the pair of `preceding_window` and `following_window` columns that are the arguments to `rolling_window`. This kernel has a few problems: - For large window extents it has poor performance scaling as O(n * window_size); - It doesn't support different end point handling; - Since pandas doesn't require it, it doesn't support computing `following_window` information; - If we want to use it in cudf-polars, it introduces another dependency (numba). To address these concerns, implement a utility function in libcudf that takes window length and offset and computes the matching preceding/following columns. This implementation exploits the fact that the column we are rolling over must be sorted in ascending order and uses `thrust::lower_bound` to provide an O(n log n) run time for all window sizes. We also now support all endpoint handling required for cudf-polars. This provides the infrastructure to work on addressing (at least partially): - rapidsai#12774 - rapidsai#14334 - rapidsai#15086 - rapidsai#15119 - rapidsai#15192

Modernise use of op_impl to omit unnecessary T template parameter.

e.g. thrust::less, not std::less.

copy-pr-bot · 2024-12-09T18:14:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

review-notebook-app · 2024-12-09T18:14:13Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

wence- requested review from a team as code owners July 18, 2024 10:45

wence- requested review from lithomas1, brandon-b-miller, vyasr and srinivasyadav18 July 18, 2024 10:45

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jul 18, 2024

wence- force-pushed the wence/fea/polars-rolling branch 2 times, most recently from 72f221b to 9988fd9 Compare July 18, 2024 10:52

davidwendt requested changes Jul 18, 2024

View reviewed changes

cpp/src/rolling/detail/window_utils.cu Outdated Show resolved Hide resolved

cpp/src/rolling/detail/window_utils.cu Outdated Show resolved Hide resolved

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved

davidwendt added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jul 18, 2024

wence- added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 18, 2024

wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 18, 2024

wence- commented Jul 18, 2024

View reviewed changes

davidwendt reviewed Jul 18, 2024

View reviewed changes

cpp/include/cudf/rolling.hpp Show resolved Hide resolved

davidwendt reviewed Jul 18, 2024

View reviewed changes

cpp/src/rolling/window_utils.cu Outdated Show resolved Hide resolved

davidwendt reviewed Jul 18, 2024

View reviewed changes

cpp/src/rolling/window_utils.cu Outdated Show resolved Hide resolved

mythrocks reviewed Jul 18, 2024

View reviewed changes

JayjeetAtGithub reviewed Jul 18, 2024

View reviewed changes

wence- added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 19, 2024

wence- marked this pull request as draft July 19, 2024 17:37

wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 19, 2024

mythrocks reviewed Jul 19, 2024

View reviewed changes

wence- added 16 commits December 3, 2024 10:54

Move window_utils out of detail

467a4f1

Move public windows_from_offset into window_utils.cu

87b1366

Fix docstring

8ca01f7

Share code and fix open/closed dispatch

5b4fd99

use if/else rather than switch

a165d07

Modernise use of op_impl to omit unnecessary T template parameter.

Wrap new rolling window utility function in pylibcudf

a63e06b

Add docstring in pylibcudf wrapper

e03e4f0

Reduce repetition

a811943

Use int8 for underlying type of which

4628a72

Use thrust comparators rather than std

d673e54

e.g. thrust::less, not std::less.

Don't synchronize between computing the two output columns

158b1d1

Throw more specific errors and mention in docstring

953da7a

Move stream sync into detail API

df77286

Implement saturating addition

42d2500

WIP: dump some code

d9010db

wence- force-pushed the wence/fea/polars-rolling branch from 7d55d4e to d9010db Compare December 9, 2024 18:13

github-actions bot added Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars labels Dec 9, 2024

wence- changed the base branch from branch-24.08 to branch-25.02 December 12, 2024 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce utility function to calculate inputs to rolling_window #16305

Introduce utility function to calculate inputs to rolling_window #16305

wence- commented Jul 18, 2024 •

edited by mythrocks

Loading

davidwendt left a comment

wence- commented Jul 18, 2024

davidwendt commented Jul 18, 2024

wence- commented Jul 18, 2024

wence- commented Jul 18, 2024

wence- Jul 18, 2024

wence- Jul 18, 2024

wence- commented Jul 18, 2024

mythrocks Jul 18, 2024

JayjeetAtGithub Jul 18, 2024

davidwendt Jul 18, 2024

mythrocks Jul 19, 2024

davidwendt Dec 9, 2024

mythrocks commented Jul 18, 2024 •

edited

Loading

wence- commented Jul 19, 2024

wence- commented Jul 19, 2024

mythrocks Jul 19, 2024 •

edited

Loading

mythrocks commented Jul 19, 2024

copy-pr-bot bot commented Dec 9, 2024

review-notebook-app bot commented Dec 9, 2024

		CUDF_EXPECTS(length.type().id() == type_to_id<OffsetType>(),
		"Length must have same the resolution as the input.");

		CUDF_EXPECTS(have_same_types(input, length),
		"Input column, length, and offset must have the same type.");

Introduce utility function to calculate inputs to rolling_window #16305

Are you sure you want to change the base?

Introduce utility function to calculate inputs to rolling_window #16305

Conversation

wence- commented Jul 18, 2024 • edited by mythrocks Loading

Description

Checklist

davidwendt left a comment

Choose a reason for hiding this comment

wence- commented Jul 18, 2024

davidwendt commented Jul 18, 2024

wence- commented Jul 18, 2024

wence- commented Jul 18, 2024

wence- Jul 18, 2024

Choose a reason for hiding this comment

wence- Jul 18, 2024

Choose a reason for hiding this comment

wence- commented Jul 18, 2024

mythrocks Jul 18, 2024

Choose a reason for hiding this comment

JayjeetAtGithub Jul 18, 2024

Choose a reason for hiding this comment

davidwendt Jul 18, 2024

Choose a reason for hiding this comment

mythrocks Jul 19, 2024

Choose a reason for hiding this comment

davidwendt Dec 9, 2024

Choose a reason for hiding this comment

mythrocks commented Jul 18, 2024 • edited Loading

wence- commented Jul 19, 2024

wence- commented Jul 19, 2024

mythrocks Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

mythrocks commented Jul 19, 2024

copy-pr-bot bot commented Dec 9, 2024

review-notebook-app bot commented Dec 9, 2024

wence- commented Jul 18, 2024 •

edited by mythrocks

Loading

mythrocks commented Jul 18, 2024 •

edited

Loading

mythrocks Jul 19, 2024 •

edited

Loading