Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding ewm_mean #1298

Merged
merged 35 commits into from
Nov 19, 2024
Merged

feat: Adding ewm_mean #1298

merged 35 commits into from
Nov 19, 2024

Conversation

DeaMariaLeon
Copy link
Member

@DeaMariaLeon DeaMariaLeon commented Nov 1, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

  • Related issue #
  • Closes #

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

I didn't add anything for Arrow because I'm waiting to see the feedback for #1290

@DeaMariaLeon DeaMariaLeon changed the title wip: wwm wip: Adding ewm_mean Nov 1, 2024
@DeaMariaLeon
Copy link
Member Author

The javascript is to add Latex:

Screenshot 2024-11-01 at 15 41 59

@DeaMariaLeon DeaMariaLeon changed the title wip: Adding ewm_mean feat: Adding ewm_mean Nov 1, 2024
@github-actions github-actions bot added the enhancement New feature or request label Nov 1, 2024
@MarcoGorelli
Copy link
Member

nice! thanks for doing this

initial comment: from what I remember, pandas and Polars might have handled adjust differently - could you add a test which parametrises over adjust being True and False? (or maybe i don't remember right and they're already aligned, either way, would be good to verify)

@DeaMariaLeon
Copy link
Member Author

Added one test for adjust.

I haven't added anything for Arrow because I'm not sure if we want to add it or not.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating! i think this is close

from tests.utils import ConstructorEager
from tests.utils import assert_equal_data

data = {"a": [1, 1, 2], "b": [1, 2, 3]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include a test with nulls too please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one at the bottom.

adjust: bool, # noqa: FBT001
) -> None:
if "pyarrow_" in str(constructor) or "dask" in str(constructor): # remove
pytest.skip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use request.applymarker(pytest.mark.xfail) please? then the test actually runs and we check that it fails, as opposed to being skipped (also, if i remember correctly pytest.skip had some undesirable behaviour)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the explanation.

Comment on lines 467 to 468
pandas and Polars handle nulls differently. So, calculating ewm over
a sequence with null values leads to distinct results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that Polars preserves null values, whereas pandas forward-fills

Can we preserve null values for pandas too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters for both polars and pandas are the same, I don't see how to do what you are asking, sorry.
Or do you mean that Narwhals should handle that both behave the same way?
In that case, for example Exp.fill_null returns different values for each library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to fill the nulls, but just preserve them - so, if a value was null to start with, it should be null in the result too

we do something like that in timestamp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing like you do in timestamp solves it for None.. but with a series like [1.0, float("nan"), 4.0] we still have this:

Pandas:

0    1.0
1    NaN
2    3.4
dtype: float64

Polars:

shape: (3,)
Series: '' [f64]
[
        1.0
        NaN
        NaN
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, cause Polars treat 'nan' differently from null - but if we use the null value for both, does the result match? e.g. [1., None, 4.]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Polars 'nan' is only the result of illegal mathematical operations (like 0/0) so it's far rarer to encounter it there

regarding older versions ci - i'd suggest making a separate virtual environment and installing the versions which show up in the show deps step of the ci job

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older versions of polars give similar results to pandas when there is a null (None).

So with an input of: {"a": [2.0, 4.0, None, 3.0]} Then:

Expected: {'a': [2.0, 3.3333333333333335, nan, 3.142857142857143]}
Got: {'a': [2.0, 3.3333333333333335, 3.3333333333333335, 3.142857142857143]}

For the moment I'm "xfailing" that test with older versions. I'm not sure if that's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - is it possible to use pl.when to preserve the null values for old versions of Polars?

either that, or raise NotImplementedError for old versions of Polars for now, and let's create an issue to track preserving null values in old Polars versions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.. I added the "raise" for now. I'll do the follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

df = nw.from_native(constructor({"a": [2.0, 4.0, None, 3.0]}))
result = df.select(nw.col("a").ewm_mean(com=1, ignore_nulls=ignore_nulls))

if ignore_nulls:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest to include the list as something you parametrise over, rather than inclusiding logic (if/then) in the test. in general, we should use if/then in tests only when necessary, it's something i try to avoid if possible (and sometimes it's not possible unfortunately)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed that.

Comment on lines +180 to +186
com: float | None = None,
span: float | None = None,
half_life: float | None = None,
alpha: float | None = None,
adjust: bool = True,
min_periods: int = 1,
ignore_nulls: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot of parameters here, do we have a test which hits each of them?

Copy link
Member Author

@DeaMariaLeon DeaMariaLeon Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add tests for the parameters then... (I can't hit all the parameters in only one test, at least not the first 4 I think).. Is that what you meant?

Comment on lines 58 to 67
if adjust:
expected = {
"a": [1.0, 1.0, 1.5714285714285714],
"b": [1.0, 1.6666666666666667, 2.4285714285714284],
}
else:
expected = {
"a": [1.0, 1.0, 1.5],
"b": [1.0, 1.5, 2.25],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@DeaMariaLeon
Copy link
Member Author

I added tests for the parameters.

@DeaMariaLeon
Copy link
Member Author

Added the NarwhalsUnstableWarning πŸ˜‡

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks @DeaMariaLeon !

I just made some minor edits based on #1401

I think

calculating ewm over a sequence with null values leads to distinct results

isn't quite exact, because the result is the same if we consider that pandas' null value is 'nan' and Polars' null values is None

The difference is just that Polars (and PyArrow, and I think all other libraries) treat 'nan' as just another floating point number (https://en.wikipedia.org/wiki/IEEE_754), and it's generally rare to enounter 'nan' in those libraries

If we initialise a Series with [None, 3.5, float('nan')], then pandas treats it as [null, 3.5, null], whereas for other libraries it's [null, 3.5, nan] - but it's quite rare to initialise a Series from a list like this with both None and 'nan', you'd make a Series from some data source (e.g. a file) and then each library would encode missing values according to its own definition of missing values

Sorry if this explanation is too long or pedantic πŸ˜„

@MarcoGorelli MarcoGorelli merged commit d7c1d4f into narwhals-dev:main Nov 19, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants