-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adding ewm_mean #1298
feat: Adding ewm_mean #1298
Conversation
nice! thanks for doing this initial comment: from what I remember, pandas and Polars might have handled |
Added one test for I haven't added anything for Arrow because I'm not sure if we want to add it or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating! i think this is close
from tests.utils import ConstructorEager | ||
from tests.utils import assert_equal_data | ||
|
||
data = {"a": [1, 1, 2], "b": [1, 2, 3]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include a test with nulls too please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one at the bottom.
tests/expr_and_series/ewm_test.py
Outdated
adjust: bool, # noqa: FBT001 | ||
) -> None: | ||
if "pyarrow_" in str(constructor) or "dask" in str(constructor): # remove | ||
pytest.skip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use request.applymarker(pytest.mark.xfail)
please? then the test actually runs and we check that it fails, as opposed to being skipped (also, if i remember correctly pytest.skip
had some undesirable behaviour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the explanation.
narwhals/series.py
Outdated
pandas and Polars handle nulls differently. So, calculating ewm over | ||
a sequence with null values leads to distinct results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's that Polars preserves null values, whereas pandas forward-fills
Can we preserve null values for pandas too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters for both polars and pandas are the same, I don't see how to do what you are asking, sorry.
Or do you mean that Narwhals should handle that both behave the same way?
In that case, for example Exp.fill_null
returns different values for each library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to fill the nulls, but just preserve them - so, if a value was null to start with, it should be null in the result too
we do something like that in timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing like you do in timestamp
solves it for None
.. but with a series like [1.0, float("nan"), 4.0]
we still have this:
Pandas:
0 1.0
1 NaN
2 3.4
dtype: float64
Polars:
shape: (3,)
Series: '' [f64]
[
1.0
NaN
NaN
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, cause Polars treat 'nan' differently from null - but if we use the null value for both, does the result match? e.g. [1., None, 4.]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Polars 'nan' is only the result of illegal mathematical operations (like 0/0) so it's far rarer to encounter it there
regarding older versions ci - i'd suggest making a separate virtual environment and installing the versions which show up in the show deps
step of the ci job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Older versions of polars give similar results to pandas when there is a null (None
).
So with an input of: {"a": [2.0, 4.0, None, 3.0]}
Then:
Expected: {'a': [2.0, 3.3333333333333335, nan, 3.142857142857143]}
Got: {'a': [2.0, 3.3333333333333335, 3.3333333333333335, 3.142857142857143]}
For the moment I'm "xfailing" that test with older versions. I'm not sure if that's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - is it possible to use pl.when
to preserve the null values for old versions of Polars?
either that, or raise NotImplementedError
for old versions of Polars for now, and let's create an issue to track preserving null values in old Polars versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.. I added the "raise" for now. I'll do the follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
tests/expr_and_series/ewm_test.py
Outdated
df = nw.from_native(constructor({"a": [2.0, 4.0, None, 3.0]})) | ||
result = df.select(nw.col("a").ewm_mean(com=1, ignore_nulls=ignore_nulls)) | ||
|
||
if ignore_nulls: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suggest to include the list as something you parametrise over, rather than inclusiding logic (if/then) in the test. in general, we should use if/then in tests only when necessary, it's something i try to avoid if possible (and sometimes it's not possible unfortunately)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed that.
com: float | None = None, | ||
span: float | None = None, | ||
half_life: float | None = None, | ||
alpha: float | None = None, | ||
adjust: bool = True, | ||
min_periods: int = 1, | ||
ignore_nulls: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a lot of parameters here, do we have a test which hits each of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add tests for the parameters then... (I can't hit all the parameters in only one test, at least not the first 4 I think).. Is that what you meant?
tests/expr_and_series/ewm_test.py
Outdated
if adjust: | ||
expected = { | ||
"a": [1.0, 1.0, 1.5714285714285714], | ||
"b": [1.0, 1.6666666666666667, 2.4285714285714284], | ||
} | ||
else: | ||
expected = { | ||
"a": [1.0, 1.0, 1.5], | ||
"b": [1.0, 1.5, 2.25], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
I added tests for the parameters. |
Added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks @DeaMariaLeon !
I just made some minor edits based on #1401
I think
calculating ewm over a sequence with null values leads to distinct results
isn't quite exact, because the result is the same if we consider that pandas' null value is 'nan'
and Polars' null values is None
The difference is just that Polars (and PyArrow, and I think all other libraries) treat 'nan'
as just another floating point number (https://en.wikipedia.org/wiki/IEEE_754), and it's generally rare to enounter 'nan'
in those libraries
If we initialise a Series with [None, 3.5, float('nan')]
, then pandas treats it as [null, 3.5, null]
, whereas for other libraries it's [null, 3.5, nan]
- but it's quite rare to initialise a Series from a list like this with both None
and 'nan'
, you'd make a Series from some data source (e.g. a file) and then each library would encode missing values according to its own definition of missing values
Sorry if this explanation is too long or pedantic π
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
I didn't add anything for Arrow because I'm waiting to see the feedback for #1290