Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Default value of key in sort(key=None) raises IndexError #48926

Open
wingkitlee0 opened this issue Nov 26, 2024 · 6 comments · May be fixed by #48969
Open

[data] Default value of key in sort(key=None) raises IndexError #48926

wingkitlee0 opened this issue Nov 26, 2024 · 6 comments · May be fixed by #48969
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@wingkitlee0
Copy link
Contributor

wingkitlee0 commented Nov 26, 2024

What happened + What you expected to happen

Dataset.sort() has a default key value of None, which should mean it sorts all columns. But it raises IndexError

IndexError: list index out of range

Also, the doc does not say what None does.

Versions / Dependencies

In [29]: ray.__version__
Out[29]: '2.39.0'

In [31]: np.__version__
Out[31]: '1.26.2'

Reproduction script

In [1]: import ray

In [2]: import pandas as pd

In [3]: df = pd.DataFrame([[1, 2], [4, 5], [6,7]], columns=["x", "y"])

In [4]: ds = ray.data.from_pandas(df)
2024-11-27 15:42:44,110 INFO worker.py:1807 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

In [5]: ds.sort().take_all()

It also does not work for sort(None)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@wingkitlee0 wingkitlee0 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 26, 2024
@jcotant1 jcotant1 added the data Ray Data-related issues label Nov 26, 2024
@Superskyyy
Copy link
Contributor

Superskyyy commented Nov 27, 2024

Can reproduce this behavior and I think current API doc didn't specify the exact behavior under None. I think its more reasonable to follow Pandas way of sort_values(by=None) which will simply raise an error, its more explicit to the pythonic way.
@wingkitlee0
sorted_df = df.sort_values(by=None)

@Superskyyy Superskyyy linked a pull request Nov 27, 2024 that will close this issue
7 tasks
@wingkitlee0
Copy link
Contributor Author

Thanks for the quick PR, but I haven't thought through raising error for None yet.

I think the option was meant to sort all columns.

At least one other place uses this "conceptually" is groupby(None), which allow grouping all columns.

I am also curious whether it was working before or not.

@Superskyyy
Copy link
Contributor

Superskyyy commented Nov 27, 2024

Thanks for the quick PR, but I haven't thought through raising error for None yet.

I think the option was meant to sort all columns.

At least one other place uses this "conceptually" is groupby(None), which allow grouping all columns.

I am also curious whether it was working before or not.

The Ray Data groupby docs seems to be broken in formatting https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.groupby.html, it doesn't clarify how it works.
image

But I assume that grouping all columns meaning grouping everything together right, essentially === doing nothing ? If we apply the same idea to sort, then it also means we do nothing when it is set to None, wdyt? Both are explicit instead of implicitly grouping/sorting based on some hidden factors. It might be counterintuitive for Pandas users if we do the opposite behavior against Pandas.

@wingkitlee0
Copy link
Contributor Author

After reading a little bit in python/ray/data/grouped_data.py (and dataset.py), I think key=None in groupby was meant to share the aggregate functions internally. Also, no sort(key=None) is called. Maybe None is unreasonable in sort() as you said.

Let's ping the ray team after the thanksgiving week..

@richardliaw
Copy link
Contributor

richardliaw commented Nov 27, 2024

Hey @wingkitlee0, I think at this point we don't support sort(None), and probably don't need to support it (for example, Pyarrow forces an explicit parameter, Pandas as well)

@wingkitlee0
Copy link
Contributor Author

Okay, it sounds like it was never working.

we should remove the default value of None then? (and raise error, update doc, etc...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants