-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Series|Expr.rolling_sum
method
#1395
Conversation
msg = ( | ||
"`Series.rolling_sum` is being called from the stable API although considered " | ||
"an unstable feature." | ||
) | ||
warn(message=msg, category=NarwhalsUnstableWarning, stacklevel=find_stacklevel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marco I think you wanted to expand and mention how to silence this warning, did I understood that correctly?
Is the following suggestion what you had in mind?
import warnings
warnings.simplefilter("ignore", NarwhalsUnstableWarning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, nice!
@@ -869,6 +869,52 @@ def cum_prod(self: Self, *, reverse: bool) -> Self: | |||
) | |||
return self._from_native_series(result) | |||
|
|||
def rolling_sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a double look at this implementation, especially for the case center=True
.
The overall idea is to:
- compute the cumulative sum
- take the difference with it shifted by window_size
- then only consider those windows that have at least
min_periods
, otherwise set it to null
For the center case, this is a bit more tricky. I am adding an offset to the start and end of the array, then performing the same computation, and finally slicing the array.
Now that I am thinking about it, a test with even sized windowmight be useful, as padding would not be symmetric Adjusted for even size windows and added a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, this is clever! do you have an idea for how to do the mean / min / max cases?
msg = ( | ||
"`Series.rolling_sum` is being called from the stable API although considered " | ||
"an unstable feature." | ||
) | ||
warn(message=msg, category=NarwhalsUnstableWarning, stacklevel=find_stacklevel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, nice!
|
Love the creativity here I tried running this hypothesis test from hypothesis import given
import hypothesis.strategies as st
import pyarrow as pa
import pandas as pd
@given(
center = st.booleans(),
values = st.lists(st.floats(-10, 10), min_size=3, max_size=10),
)
@pytest.mark.filterwarnings('ignore:.*:narwhals.exceptions.NarwhalsUnstableWarning')
def test_rolling_sum_hypothesis(center: bool, values: list[float]) -> None:
s = pd.Series(values)
n_missing = random.randint(0, len(s)-1)
window_size = random.randint(1, len(s))
min_periods = random.randint(0, window_size)
mask = random.sample(range(len(s)), n_missing)
s[mask] = None
df = pd.DataFrame({'a': s})
expected = s.rolling(window=window_size, center=center, min_periods=min_periods).sum().to_frame('a')
result = nw.from_native(pa.Table.from_pandas(df)).select(nw.col('a').rolling_sum(window_size, center=center, min_periods=min_periods))
expected_dict = nw.from_native(expected, eager_only=True).to_dict(as_series=False)
assert_equal_data(result, expected_dict) and it's picking up some small inconsistencies:
|
Well ok |
"kwargs": {"window_size": 0}, | ||
"expected": [float("nan"), 1.0, 3.0, 3.0, 7.0, 13.0, 24.0], | ||
}, | ||
# There are still some edge cases to take care of with nulls and min_periods=0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example:
In [1]: import pandas as pd
In [2]: import polars as pl
In [3]: data = [float("nan"), 1, 2]
In [4]: pl.Series(data).rolling_sum(2, min_periods=0)
Out[4]:
shape: (3,)
Series: '' [f64]
[
NaN
NaN
3.0
]
In [5]: pd.Series(data).rolling(2, min_periods=0).sum()
Out[5]:
0 0.0
1 1.0
2 3.0
dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this seems buggy in both?
- polars: why the second value is NaN if
min_periods=0
? - pandas: why replace the first value with a 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Polars min_periods
refers to missing data, not to 'nan'
the mean of 'nan'
and 1 is 'nan'
I would tend towards restricting the API to:
and raise in other cases. Edit: old polars seems to break if that's not the case. |
thanks - shall we also include the hypothesis test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very impressive @FBruzzesi , well done
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
So I wanted to start with
sum
, assuming it would have been simpler for arrow to implement in a way which is not naive.Running some benchmarks, performances of this with 1M rows is in the same ballpark of pandas, while the naive way in #1290 is orders of magnitude slower. I would consider this the way to go forward.