feat(python): Improve `df.corr`, add "spearman" method and row labels, align with `pl.corr` #20967

Julian-J-S · 2025-01-29T10:00:09Z

fix #20957
fix #16864
fix #14457
fix #16832

👍

remove numpy dependency -> use pl.corr instead
align parameters with pl.corr
add "spearman" method
add parameter to optionally show column with row names

👎

Muuuuch slower: 10-1000x depending on number of columns
- pl.corr is already about 10x slower than numpy
- no parallelization
- calculation could be more efficient
However, IMO correlation analysis is mostly an exploration tool and for that and many normal use cases it is fast enough.
If a user wants more speed they can still use numpy instead but must to that explicitly

Julian-J-S · 2025-01-29T10:44:22Z

pretty sure the failed checks have nothing to do with my changes 😄

ritchie46 · 2025-02-01T15:03:45Z

py-polars/polars/dataframe/frame.py

+        │ abc   ┆ -0.5 ┆ 0.5  ┆ -0.5 ┆ 1.0  │
+        └───────┴──────┴──────┴──────┴──────┘
+        """
+        df = pl.DataFrame(


There is a missed optimization here. corr(a, b) == cor(b, a). We should exploit that symmetry. This should be done on the rust side and should be parallel.

yes, I agree.

However, I looked into this and I am stuck with cyclic import problems (rust skill issue probably :D).

impl DataFrame -> polars-core

corr, covariance, pearson_corr, spearman_rank_corr -> polars-plan

cov, pearson_corr (actually implementation) -> polars-ops

There is some "bigger" refactoring required I guess. Ideas?

The PR was a "quick win" which enables corr on a DataFrame which is comparatively slow but still fast enough for everyday use and which can be improved later on.

For the rust implementation and refactoring I am probably not the best to implement or would need some help 😇

alexander-beedie · 2025-02-10T08:52:56Z

I'm slightly alarmed at the magnitude of the potential slowdown - could you clarify when/why it can get that much slower? Is it always going to slow down by 1-3 orders of magnitude? Would it be a good idea to add some docs that point people back to numpy where performance matters? (There might be an outright revolt over here if corr slows down by that much, as research are constantly looking at correlation values 🤣)

ritchie46 · 2025-02-10T09:07:52Z

It does double the work (doesn't exploit) symmetry. Numpy also probably goes into lablas/lapack for this, which I believe is cache optimal and parallel. We should first exploit the symmetry and run in parallel. I will close this, as that will be a totally different PR.

feat(python): Improve df.corr, add "spearman" method and row labels

099b18a

Julian-J-S requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli and reswqa as code owners January 29, 2025 10:00

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 29, 2025

ritchie46 requested changes Feb 1, 2025

View reviewed changes

ritchie46 closed this Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Improve `df.corr`, add "spearman" method and row labels, align with `pl.corr` #20967

feat(python): Improve `df.corr`, add "spearman" method and row labels, align with `pl.corr` #20967

Julian-J-S commented Jan 29, 2025

Julian-J-S commented Jan 29, 2025

ritchie46 Feb 1, 2025

Julian-J-S Feb 2, 2025

alexander-beedie commented Feb 10, 2025 •

edited

Loading

ritchie46 commented Feb 10, 2025

feat(python): Improve df.corr, add "spearman" method and row labels, align with pl.corr #20967

feat(python): Improve df.corr, add "spearman" method and row labels, align with pl.corr #20967

Conversation

Julian-J-S commented Jan 29, 2025

👍

👎

Julian-J-S commented Jan 29, 2025

ritchie46 Feb 1, 2025

Choose a reason for hiding this comment

Julian-J-S Feb 2, 2025

Choose a reason for hiding this comment

alexander-beedie commented Feb 10, 2025 • edited Loading

ritchie46 commented Feb 10, 2025

feat(python): Improve `df.corr`, add "spearman" method and row labels, align with `pl.corr` #20967

feat(python): Improve `df.corr`, add "spearman" method and row labels, align with `pl.corr` #20967

alexander-beedie commented Feb 10, 2025 •

edited

Loading