Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Support polars.Expr.rank #1323

Open
adamblake opened this issue Nov 5, 2024 · 4 comments
Open

[Enh]: Support polars.Expr.rank #1323

adamblake opened this issue Nov 5, 2024 · 4 comments
Assignees
Labels
accepted enhancement New feature or request

Comments

@adamblake
Copy link

adamblake commented Nov 5, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

I am abstracting a library for computing teaching metrics so that researchers can use their data processing library of choice. Narwhals seems like a good bet (also shout-out to @mikeckennedy for having you on the podcast!). I can't share the specific repository because it contains internal scripts, but this would be supporting CourseKata, a low-cost textbook platform dedicated to continuous improvement based on learning science principles.

Please describe the purpose of the new feature or describe the problem to solve.

I would like support for the polars.Expr.rank method. One example of how it could be used is to count how often an instructor teaches, given some grouping variable (window). In Polars it might look like this:

df.sort("academic_year").with_columns(
  years_taught=pl.col("academic_year")
    .rank(method="dense")
    .over("instructor_id")
)

This would window over instructor_id and get the rank by academic_year. Essentially, we will get a count of how many academic years an instructor has taught in, and because we are using the "dense" ranking, teaching multiple classes in a year counts as a single year taught.

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

I could probably achieve this by making an intermediate data frame where I filter down academic_year using unique(), and then make some kind of counter variable based on instructor_id, and then join() that back to the initial table.

Instead I would rather just go back to using Polars until this feature is supported (if it is on your roadmap!).

Additional information that may help us understand your needs.

No response

@FBruzzesi FBruzzesi added the enhancement New feature or request label Nov 5, 2024
@FBruzzesi
Copy link
Member

FBruzzesi commented Nov 5, 2024

Hey @adamblake , thanks for the feature request. This is definitly in scope 👌 we are currently finalizing an integration, but we will get soon back to expanding the API 😁

@FBruzzesi FBruzzesi self-assigned this Nov 6, 2024
@mikeckennedy
Copy link

also shout-out to @mikeckennedy for having you on the podcast!

Thanks @adamblake.

@FBruzzesi
Copy link
Member

Hey @adamblake , I started to take a look. Just for context I would like to mention that we will be able to fully support rank for pandas and polars, while for pyarrow there could be some shortcomings. Namely:

  • the default method for polars method="average" is the only one not supported in arrow
  • pyarrow TableGroupBy.aggregate does not support ranking in any form. I see in your example that you would like to use in a over context, which for pandas and pyarrow is equivalent to performing a group by and join, therefore this won't be supported for pyarrow.

@adamblake
Copy link
Author

@FBruzzesi thanks for the context. We use polars / pandas so this would be great for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants