feat: add `Series|Expr.rolling_sum` method #1395

FBruzzesi · 2024-11-17T15:29:16Z

What type of PR is this? (check all applicable)

Related issues

Related issue feat: support rolling / ewm #1254
Closes api: "unstable" features #1367

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

So I wanted to start with sum, assuming it would have been simpler for arrow to implement in a way which is not naive.

Running some benchmarks, performances of this with 1M rows is in the same ballpark of pandas, while the naive way in #1290 is orders of magnitude slower. I would consider this the way to go forward.

narwhals/exceptions.py

FBruzzesi · 2024-11-17T15:31:32Z

narwhals/stable/v1/__init__.py

+        msg = (
+            "`Series.rolling_sum` is being called from the stable API although considered "
+            "an unstable feature."
+        )
+        warn(message=msg, category=NarwhalsUnstableWarning, stacklevel=find_stacklevel())


Marco I think you wanted to expand and mention how to silence this warning, did I understood that correctly?

Is the following suggestion what you had in mind?

import warnings warnings.simplefilter("ignore", NarwhalsUnstableWarning)

FBruzzesi · 2024-11-17T15:36:13Z

narwhals/_arrow/series.py

@@ -869,6 +869,52 @@ def cum_prod(self: Self, *, reverse: bool) -> Self:
        )
        return self._from_native_series(result)

+    def rolling_sum(


Please take a double look at this implementation, especially for the case center=True.

The overall idea is to:

compute the cumulative sum

take the difference with it shifted by window_size

then only consider those windows that have at least min_periods, otherwise set it to null

For the center case, this is a bit more tricky. I am adding an offset to the start and end of the array, then performing the same computation, and finally slicing the array.

~~Now that I am thinking about it, a test with even sized windowmight be useful, as padding would not be symmetric~~ Adjusted for even size windows and added a test

MarcoGorelli

wow, this is clever! do you have an idea for how to do the mean / min / max cases?

narwhals/exceptions.py

MarcoGorelli · 2024-11-17T17:54:12Z

narwhals/stable/v1/__init__.py

+        msg = (
+            "`Series.rolling_sum` is being called from the stable API although considered "
+            "an unstable feature."
+        )
+        warn(message=msg, category=NarwhalsUnstableWarning, stacklevel=find_stacklevel())


FBruzzesi · 2024-11-17T18:05:31Z

wow, this is clever! do you have an idea for how to do the mean / min / max cases?

mean should be possible by combining sum and count
min/max I will benchmark performances falling back to numpy.lib.stride_tricks.sliding_window_view

MarcoGorelli · 2024-11-17T19:46:42Z

Love the creativity here

I tried running this hypothesis test

from hypothesis import given
import hypothesis.strategies as st
import pyarrow as pa
import pandas as pd

@given(
    center = st.booleans(),
    values = st.lists(st.floats(-10, 10), min_size=3, max_size=10),
)
@pytest.mark.filterwarnings('ignore:.*:narwhals.exceptions.NarwhalsUnstableWarning')
def test_rolling_sum_hypothesis(center: bool, values: list[float]) -> None:
    s = pd.Series(values)
    n_missing = random.randint(0, len(s)-1)
    window_size = random.randint(1, len(s))
    min_periods = random.randint(0, window_size)
    mask = random.sample(range(len(s)), n_missing)
    s[mask] = None
    df = pd.DataFrame({'a': s})
    expected = s.rolling(window=window_size, center=center, min_periods=min_periods).sum().to_frame('a')
    result = nw.from_native(pa.Table.from_pandas(df)).select(nw.col('a').rolling_sum(window_size, center=center, min_periods=min_periods))
    expected_dict = nw.from_native(expected, eager_only=True).to_dict(as_series=False)
    assert_equal_data(result, expected_dict)

and it's picking up some small inconsistencies:

In [19]: s
Out[19]: 
0    0.0
1    NaN
2    0.0
dtype: float64

In [20]: s.rolling(min_periods=0, center=False, window=2).sum()
Out[20]: 
0    0.0
1    0.0
2    0.0
dtype: float64

In [21]: nw.from_native(pa.chunked_array([s]), series_only=True).rolling_sum(min_periods=0, center=False, window_size=2).to_native()
Out[21]: 
<pyarrow.lib.ChunkedArray object at 0x7f7cd5793ee0>
[
  [
    null,
    null,
    null
  ]
]

In [22]: nw.from_native(pl.from_pandas(s), series_only=True).rolling_sum(min_periods=0, center=False, window_size=2).to_native()
Out[22]: 
shape: (3,)
Series: '' [f64]
[
        0.0
        0.0
        0.0
]

FBruzzesi · 2024-11-17T20:36:45Z

I tried running this hypothesis test

and it's picking up some small inconsistencies

Well ok min_periods = min_periods or window_size evaluates to 2, because of 0 or 2. Let me adjust

FBruzzesi · 2024-11-17T21:44:26Z

tests/expr_and_series/rolling_sum_test.py

+        "kwargs": {"window_size": 0},
+        "expected": [float("nan"), 1.0, 3.0, 3.0, 7.0, 13.0, 24.0],
+    },
+    # There are still some edge cases to take care of with nulls and min_periods=0:


Example:

In [1]: import pandas as pd In [2]: import polars as pl In [3]: data = [float("nan"), 1, 2] In [4]: pl.Series(data).rolling_sum(2, min_periods=0) Out[4]: shape: (3,) Series: '' [f64] [ NaN NaN 3.0 ] In [5]: pd.Series(data).rolling(2, min_periods=0).sum() Out[5]: 0 0.0 1 1.0 2 3.0 dtype: float64

Honestly, this seems buggy in both?

polars: why the second value is NaN if min_periods=0?

pandas: why replace the first value with a 0?

in Polars min_periods refers to missing data, not to 'nan'

the mean of 'nan' and 1 is 'nan'

FBruzzesi · 2024-11-17T21:47:25Z

I would tend towards restricting the API to:

window_size strictly positive
min_periods either None or strictly positive

and raise in other cases.

Edit: old polars seems to break if that's not the case.

MarcoGorelli · 2024-11-18T10:26:06Z

thanks - shall we also include the hypothesis test?

tests/expr_and_series/rolling_sum_test.py

MarcoGorelli

very impressive @FBruzzesi , well done

feat: add Series|Expr.rolling_sum method

02eb2c9

github-actions bot added the enhancement New feature or request label Nov 17, 2024

FBruzzesi commented Nov 17, 2024

View reviewed changes

narwhals/exceptions.py Show resolved Hide resolved

FBruzzesi commented Nov 17, 2024

View reviewed changes

adjust for even window and center=True

8afd368

MarcoGorelli reviewed Nov 17, 2024

View reviewed changes

improvements

669b6bd

FBruzzesi commented Nov 17, 2024

View reviewed changes

strictly positive window_size and min_periods

83cf14c

DeaMariaLeon mentioned this pull request Nov 18, 2024

api: "unstable" features #1367

Closed

FBruzzesi added 3 commits November 18, 2024 14:23

add hypothesis test

2f04ab5

better docstrings

efe4ff1

forgot stable

af86577

MarcoGorelli reviewed Nov 18, 2024

View reviewed changes

tests/expr_and_series/rolling_sum_test.py Show resolved Hide resolved

skip hyp for pandas < (1, 0)

a2d1df4

MarcoGorelli approved these changes Nov 18, 2024

View reviewed changes

MarcoGorelli merged commit bbf2aa3 into main Nov 18, 2024
22 checks passed

FBruzzesi deleted the feat/rolling-sum branch November 18, 2024 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `Series|Expr.rolling_sum` method #1395

feat: add `Series|Expr.rolling_sum` method #1395

FBruzzesi commented Nov 17, 2024

FBruzzesi Nov 17, 2024 •

edited

Loading

MarcoGorelli Nov 17, 2024

FBruzzesi Nov 17, 2024 •

edited

Loading

MarcoGorelli left a comment

MarcoGorelli Nov 17, 2024

FBruzzesi commented Nov 17, 2024

MarcoGorelli commented Nov 17, 2024

FBruzzesi commented Nov 17, 2024 •

edited

Loading

FBruzzesi Nov 17, 2024

FBruzzesi Nov 17, 2024

MarcoGorelli Nov 17, 2024

FBruzzesi commented Nov 17, 2024 •

edited

Loading

MarcoGorelli commented Nov 18, 2024

MarcoGorelli left a comment

feat: add Series|Expr.rolling_sum method #1395

feat: add Series|Expr.rolling_sum method #1395

Conversation

FBruzzesi commented Nov 17, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

FBruzzesi Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Nov 17, 2024

Choose a reason for hiding this comment

FBruzzesi Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Nov 17, 2024

Choose a reason for hiding this comment

FBruzzesi commented Nov 17, 2024

MarcoGorelli commented Nov 17, 2024

FBruzzesi commented Nov 17, 2024 • edited Loading

FBruzzesi Nov 17, 2024

Choose a reason for hiding this comment

FBruzzesi Nov 17, 2024

Choose a reason for hiding this comment

MarcoGorelli Nov 17, 2024

Choose a reason for hiding this comment

FBruzzesi commented Nov 17, 2024 • edited Loading

MarcoGorelli commented Nov 18, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

feat: add `Series|Expr.rolling_sum` method #1395

feat: add `Series|Expr.rolling_sum` method #1395

FBruzzesi Nov 17, 2024 •

edited

Loading

FBruzzesi Nov 17, 2024 •

edited

Loading

FBruzzesi commented Nov 17, 2024 •

edited

Loading

FBruzzesi commented Nov 17, 2024 •

edited

Loading