Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise on out of bounds range access #1061

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion lib/explorer/shared.ex
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,24 @@ defmodule Explorer.Shared do
names
end

def to_existing_columns(%{names: names}, %Range{} = columns, _raise?) do
def to_existing_columns(%{names: names} = df, first..last//step = columns, raise?) do
if raise? do
n_cols = Explorer.DataFrame.n_columns(df)

# With `Enum.slice/2`, negative indices are counted from the end.
[slice_min, slice_pseudo_max] =
[first, last]
|> Enum.map(&if(&1 < 0, do: n_cols + &1, else: &1))
|> Enum.sort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ranges can be reverse ordered, I am not sure how much we want to rely on sort. My suggestion would be to validate first and last independently:

for pos <- [first, last] do
  cond do
    pos >= 0 and pos < n_cols -> :ok
    pos < 0 and pos >= -n_cols -> :ok
    true -> raise "..."
  end
end

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum.slice/2 says it doesn't support reverse ordered ranges yet. Do you want to support them here in anticipation of their future support in Elixir 2.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. If the code above works (big if), then it should be simpler and more future proof. Worth giving it a try?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Unfortunately though, I think your algorithm doesn't work for the following situation:

df = DF.new(a: [1, 2, 3], b: ["a", "b", "c"], c: [4.0, 5.1, 6.2])

n_cols = DF.n_columns(df) #=> 3
slice = 1..10//10

assert slice.last > n_cols
assert DF.to_columns(df[slice]) == %{"b" => ["a", "b", "c"]}

Here slice is a valid subset of the columns because even though slice.last > n_cols, Enum.to_list(slice) == [1]. This only happens when step > 1 of course.

We can still try and work out a future proof approach though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can adjust the last for a range start..finish//step, the last element can be calculated using this formula:
start + (div(finish - start, step) * step)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be best to leave this to until we add a slice! to Elixir? 🤔

I think I'd prefer to merge this if possible. The semantics of out-of-bounds are currently different between Nx and Explorer. This PR attempts to bring them in line.

The other one is an empty range altogether. And then there are ranges that raise when out of index.

Sorry if I'm not understanding. To my eyes, 10..0//1 is both empty on its own and out of bounds for a 2-column dataframe and so should raise. Can you clarify what you think should happen with these two cases?

df = DF.new(a: [1], b: [2]) # 2 columns

df[1..0//1] # empty, but within bounds
df[9..0//1] # empty, but out of bounds

I think df[1..0//1] should return an empty dataframe and df[9..0//1] should raise. The other option is that all empty ranges (post normalizing) should return an empty dataframe regardless of bounds.

I admit this is a weird case and I don't feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, if 1..10//10 should not raise for being outbounds because 10 is not actually included, so should not 10..1//1 for the same reason?

I agree that it doesn’t really matter which, but I think for those cases it should be consistent. I think always checking first and last, regardless of step, is the simplest way to go about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, if 1..10//10 should not raise for being outbounds because 10 is not actually included, so should not 10..1//1 for the same reason?

Ah great point. It's like whack-a-mole with these cases!

I think always checking first and last, regardless of step, is the simplest way to go about it?

I think you're right. I'll go ahead and do that.

The results will differ slightly from Enum.slice/2. But we need to do what makes sense for us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to think of slice as an implementation detail. I think my proposed implementation should work for checking indexes too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to think of slice as an implementation detail.

That makes sense!

I think my proposed implementation should work for checking indexes too!

It did! Hope it's ok that I played a little code golf with it. I realized yours was equivalent to:

if max(abs(first), abs(last)) >= n_cols do
  raise ...
end


slice_max = slice_min + step * (Range.size(slice_min..slice_pseudo_max//step) - 1)

if slice_min < 0 or slice_max >= n_cols do
raise ArgumentError,
"range #{inspect(columns)} is out of bounds for a dataframe with #{n_cols} column(s)"
end
end

Enum.slice(names, columns)
end

Expand Down
4 changes: 3 additions & 1 deletion test/explorer/data_frame/lazy_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -901,7 +901,9 @@ defmodule Explorer.DataFrame.LazyTest do
df3 = DF.distinct(df, ..)
assert DF.names(df3) == DF.names(df)

assert df == DF.distinct(df, 100..200)
assert_raise ArgumentError,
"range 100..200 is out of bounds for a dataframe with 10 column(s)",
fn -> DF.distinct(df, 100..200) end
end
end

Expand Down
18 changes: 14 additions & 4 deletions test/explorer/data_frame_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -2690,6 +2690,7 @@ defmodule Explorer.DataFrameTest do
assert DF.to_columns(df[[:a, :c]]) == %{"a" => [1, 2, 3], "c" => [4.0, 5.1, 6.2]}
assert DF.to_columns(df[0..-2//1]) == %{"a" => [1, 2, 3], "b" => ["a", "b", "c"]}
assert DF.to_columns(df[-3..-1]) == DF.to_columns(df)
assert DF.to_columns(df[1..3//3]) == %{"b" => ["a", "b", "c"]}
assert DF.to_columns(df[..]) == DF.to_columns(df)

assert %Series{} = s1 = df[0]
Expand All @@ -2715,7 +2716,13 @@ defmodule Explorer.DataFrameTest do
~r"could not find column name \"class\"",
fn -> df[:class] end

assert DF.to_columns(df[0..100]) == DF.to_columns(df)
assert_raise ArgumentError,
"range 0..3 is out of bounds for a dataframe with 3 column(s)",
fn -> DF.to_columns(df[0..3]) end

assert_raise ArgumentError,
"range 0..-4//1 is out of bounds for a dataframe with 3 column(s)",
fn -> DF.to_columns(df[0..-4//1]) end
end

test "pop/2" do
Expand Down Expand Up @@ -2961,7 +2968,9 @@ defmodule Explorer.DataFrameTest do
df3 = DF.distinct(df, ..)
assert DF.names(df3) == DF.names(df)

assert df == DF.distinct(df, 100..200)
assert_raise ArgumentError,
"range 100..200 is out of bounds for a dataframe with 10 column(s)",
fn -> DF.drop_nil(df, 100..200) end
end
end

Expand All @@ -2983,8 +2992,9 @@ defmodule Explorer.DataFrameTest do
fn -> DF.drop_nil(df, [3, 4, 5]) end

# It takes the slice of columns in the range
df4 = DF.drop_nil(df, 0..200)
assert DF.to_columns(df4) == %{"a" => [1], "b" => [1]}
assert_raise ArgumentError,
"range 0..200 is out of bounds for a dataframe with 2 column(s)",
fn -> DF.drop_nil(df, 0..200) end
end

describe "relocate/3" do
Expand Down
Loading