Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Improve df.corr, add "spearman" method and row labels, align with pl.corr #20967

Closed
wants to merge 1 commit into from

Conversation

Julian-J-S
Copy link
Contributor

fix #20957
fix #16864
fix #14457
fix #16832

👍

  • remove numpy dependency -> use pl.corr instead
  • align parameters with pl.corr
  • add "spearman" method
  • add parameter to optionally show column with row names

👎

  • Muuuuch slower: 10-1000x depending on number of columns
    • pl.corr is already about 10x slower than numpy
    • no parallelization
    • calculation could be more efficient
  • However, IMO correlation analysis is mostly an exploration tool and for that and many normal use cases it is fast enough.
  • If a user wants more speed they can still use numpy instead but must to that explicitly

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 29, 2025
@Julian-J-S
Copy link
Contributor Author

pretty sure the failed checks have nothing to do with my changes 😄

│ abc ┆ -0.5 ┆ 0.5 ┆ -0.5 ┆ 1.0 │
└───────┴──────┴──────┴──────┴──────┘
"""
df = pl.DataFrame(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missed optimization here. corr(a, b) == cor(b, a). We should exploit that symmetry. This should be done on the rust side and should be parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree.

However, I looked into this and I am stuck with cyclic import problems (rust skill issue probably :D).

  • impl DataFrame -> polars-core
  • corr, covariance, pearson_corr, spearman_rank_corr -> polars-plan
  • cov, pearson_corr (actually implementation) -> polars-ops

There is some "bigger" refactoring required I guess. Ideas?

The PR was a "quick win" which enables corr on a DataFrame which is comparatively slow but still fast enough for everyday use and which can be improved later on.

For the rust implementation and refactoring I am probably not the best to implement or would need some help 😇

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 10, 2025

I'm slightly alarmed at the magnitude of the potential slowdown - could you clarify when/why it can get that much slower? Is it always going to slow down by 1-3 orders of magnitude? Would it be a good idea to add some docs that point people back to numpy where performance matters? (There might be an outright revolt over here if corr slows down by that much, as research are constantly looking at correlation values 🤣)

@ritchie46
Copy link
Member

It does double the work (doesn't exploit) symmetry. Numpy also probably goes into lablas/lapack for this, which I believe is cache optimal and parallel. We should first exploit the symmetry and run in parallel. I will close this, as that will be a totally different PR.

@ritchie46 ritchie46 closed this Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
3 participants