Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST (string dtype): clean-up assorted xfails #60345

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 2 additions & 7 deletions pandas/tests/base/test_conversion.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

from pandas.compat import HAS_PYARROW
from pandas.compat.numpy import np_version_gt2

Expand Down Expand Up @@ -392,9 +390,6 @@ def test_to_numpy(arr, expected, zero_copy, index_or_series_or_array):
assert np.may_share_memory(result_nocopy1, result_nocopy2)


@pytest.mark.xfail(
using_string_dtype() and not HAS_PYARROW, reason="TODO(infer_string)", strict=False
)
@pytest.mark.parametrize("as_series", [True, False])
@pytest.mark.parametrize(
"arr", [np.array([1, 2, 3], dtype="int64"), np.array(["a", "b", "c"], dtype=object)]
Expand All @@ -406,13 +401,13 @@ def test_to_numpy_copy(arr, as_series, using_infer_string):

# no copy by default
result = obj.to_numpy()
if using_infer_string and arr.dtype == object:
if using_infer_string and arr.dtype == object and obj.dtype.storage == "pyarrow":
assert np.shares_memory(arr, result) is False
else:
assert np.shares_memory(arr, result) is True

result = obj.to_numpy(copy=False)
if using_infer_string and arr.dtype == object:
if using_infer_string and arr.dtype == object and obj.dtype.storage == "pyarrow":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why if a user specifically requests not to make a copy are we converting the numpy array to a pyarrow arrow baked string array for what is an immutable index? There are other performance benefits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way around, the pyarrow array (stored under the hood in obj) is being converted to a numpy array, and that can just never be done without a copy (different memory layout)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first line of the test is obj = pd.Index(arr, copy=False)

so if we have a numpy arr and specify copy=False for the immutable index we get a pyarrow backed index and a copy is made? and then the .to_numpy() method makes another copy?

>>> pd.options.future.infer_string = True
>>> arr = np.array(["a", "b", "c"], dtype=object)
>>> arr
array(['a', 'b', 'c'], dtype=object)
>>> 
>>> idx = pd.Index(arr, copy=False)
>>> idx
Index(['a', 'b', 'c'], dtype='str')
>>> 

so the question is perhaps should idx = pd.Index(arr, copy=False) return an Index with a string dtype, perhaps raise like numpy now do for __array__ when a copy can't be made or is this a moot point when CoW is extended to the Indexes as the copy keyword would be irrelevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I assumed you were commenting on what is being tested here, i.e. the obj.to_numpy(copy=False) copying or not.

For pd.Index(arr, copy=False): in general our copy keywords in constructors are not strict, but only mean to avoid a copy at "best effort" (e.g. also if you pass a python list, it will make a copy regardless of that keyword). If we would want to change that more generally, that's a bigger topic to discuss.

this a moot point when CoW is extended to the Indexes as the copy keyword would be irrelevant?

It's certainly not a moot point, because with CoW we actually copy more often on input with non-pandas objects. Although it seems we didn't make that change for Index(..), where the default is still copy=False

assert np.shares_memory(arr, result) is False
else:
assert np.shares_memory(arr, result) is True
Expand Down
5 changes: 1 addition & 4 deletions pandas/tests/indexes/multi/test_setops.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

import pandas as pd
from pandas import (
CategoricalIndex,
Expand Down Expand Up @@ -754,13 +752,12 @@ def test_intersection_keep_ea_dtypes(val, any_numeric_ea_dtype):
tm.assert_index_equal(result, expected)


@pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)")
def test_union_with_na_when_constructing_dataframe():
# GH43222
series1 = Series(
(1,),
index=MultiIndex.from_arrays(
[Series([None], dtype="string"), Series([None], dtype="string")]
[Series([None], dtype="str"), Series([None], dtype="str")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does fix the test. Is the behavior if this change to the test is not made correct?

i.e. the series1.index.dtypes are object and the series2.index.dtypes are str and the resulting dtype for the columns index using the DataFrame constructor is object. Would we not expect the DataFrame constructor to return a str index for the columns in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps related to #60338?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. the series1.index.dtypes are object and the series2.index.dtypes are str and the resulting dtype for the columns index using the DataFrame constructor is object.

With the above fix (and when infer_string is enabled), the test uses str dtype for the index levels of both series1 and series2, and then also the expected result gets created with that.

So it's only testing that the NaNs are properly matched when creating the rows from Series objects with a MultiIndex, it does not test having different dtypes in series1 vs series2.

That's also something we could test, though (and then if you have object dtype index and str dtype index, one would expect the result to be object dtype index (since that is the "common" dtype), but was not how the test was currently set up.

perhaps related to #60338?

I suppose not because here we don't actually have empty objects. Both series1 and series2 have data and have a non-empty index.

),
)
series2 = Series((10, 20), index=MultiIndex.from_tuples(((None, None), ("a", "b"))))
Expand Down
12 changes: 1 addition & 11 deletions pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,7 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

from pandas.compat import (
HAS_PYARROW,
IS64,
)
from pandas.compat import IS64
from pandas.errors import InvalidIndexError
import pandas.util._test_decorators as td

Expand Down Expand Up @@ -823,11 +818,6 @@ def test_isin(self, values, index, expected):
expected = np.array(expected, dtype=bool)
tm.assert_numpy_array_equal(result, expected)

@pytest.mark.xfail(
using_string_dtype() and not HAS_PYARROW,
reason="TODO(infer_string)",
strict=False,
)
def test_isin_nan_common_object(
self, nulls_fixture, nulls_fixture2, using_infer_string
):
Expand Down
3 changes: 0 additions & 3 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

import pandas.util._test_decorators as td

import pandas as pd
Expand Down Expand Up @@ -625,7 +623,6 @@ def test_reader_dtype_str(self, read_ext, dtype, expected):
expected = DataFrame(expected)
tm.assert_frame_equal(actual, expected)

@pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)", strict=False)
def test_dtype_backend(self, read_ext, dtype_backend, engine, tmp_excel):
# GH#36712
if read_ext in (".xlsb", ".xls"):
Expand Down
5 changes: 1 addition & 4 deletions pandas/tests/io/excel/test_writers.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

from pandas.compat._optional import import_optional_dependency
import pandas.util._test_decorators as td

Expand Down Expand Up @@ -1387,12 +1385,11 @@ def test_freeze_panes(self, tmp_excel):
result = pd.read_excel(tmp_excel, index_col=0)
tm.assert_frame_equal(result, expected)

@pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)")
def test_path_path_lib(self, engine, ext):
df = DataFrame(
1.1 * np.arange(120).reshape((30, 4)),
columns=Index(list("ABCD")),
index=Index([f"i-{i}" for i in range(30)], dtype=object),
index=Index([f"i-{i}" for i in range(30)]),
)
writer = partial(df.to_excel, engine=engine)

Expand Down
1 change: 0 additions & 1 deletion pandas/tests/io/test_stata.py
Original file line number Diff line number Diff line change
Expand Up @@ -1719,7 +1719,6 @@ def test_date_parsing_ignores_format_details(self, column, datapath):
formatted = df.loc[0, column + "_fmt"]
assert unformatted == formatted

# @pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)")
@pytest.mark.parametrize("byteorder", ["little", "big"])
def test_writer_117(self, byteorder, temp_file, using_infer_string):
original = DataFrame(
Expand Down
9 changes: 5 additions & 4 deletions pandas/tests/reshape/test_union_categoricals.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

from pandas.core.dtypes.concat import union_categoricals

import pandas as pd
Expand Down Expand Up @@ -124,12 +122,15 @@ def test_union_categoricals_nan(self):
exp = Categorical([np.nan, np.nan, np.nan, np.nan])
tm.assert_categorical_equal(res, exp)

@pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)", strict=False)
@pytest.mark.parametrize("val", [[], ["1"]])
def test_union_categoricals_empty(self, val, request, using_infer_string):
# GH 13759
if using_infer_string and val == ["1"]:
request.applymarker(pytest.mark.xfail("object and strings dont match"))
request.applymarker(
pytest.mark.xfail(
reason="TDOD(infer_string) object and strings dont match"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo here that may mean this gets missed when grepping the TODOs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! Yes, I do grep for that so good to fix that typo

)
)
res = union_categoricals([Categorical([]), Categorical(val)])
exp = Categorical(val)
tm.assert_categorical_equal(res, exp)
Expand Down