TST (string dtype): clean-up assorted xfails #60345

jorisvandenbossche · 2024-11-17T09:40:50Z

An bunch of assorted xfails that were no longer needed or had a trivial fix

lumberbot-app · 2024-11-17T12:41:19Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 e7d1964ab7405d54d919bb289318d01e9eb72cd1

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60345: TST (string dtype): clean-up assorted xfails'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60345-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60345 on branch 2.3.x (TST (string dtype): clean-up assorted xfails)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

(cherry picked from commit e7d1964)

jorisvandenbossche · 2024-11-17T12:47:13Z

Manual backport -> #60349

simonjayhawkins · 2024-11-18T09:06:50Z

pandas/tests/base/test_conversion.py

        assert np.shares_memory(arr, result) is False
    else:
        assert np.shares_memory(arr, result) is True

    result = obj.to_numpy(copy=False)
-    if using_infer_string and arr.dtype == object:
+    if using_infer_string and arr.dtype == object and obj.dtype.storage == "pyarrow":


Why if a user specifically requests not to make a copy are we converting the numpy array to a pyarrow arrow baked string array for what is an immutable index? There are other performance benefits?

It's the other way around, the pyarrow array (stored under the hood in obj) is being converted to a numpy array, and that can just never be done without a copy (different memory layout)

the first line of the test is obj = pd.Index(arr, copy=False)

so if we have a numpy arr and specify copy=False for the immutable index we get a pyarrow backed index and a copy is made? and then the .to_numpy() method makes another copy?

>>> pd.options.future.infer_string = True >>> arr = np.array(["a", "b", "c"], dtype=object) >>> arr array(['a', 'b', 'c'], dtype=object) >>> >>> idx = pd.Index(arr, copy=False) >>> idx Index(['a', 'b', 'c'], dtype='str') >>>

so the question is perhaps should idx = pd.Index(arr, copy=False) return an Index with a string dtype, perhaps raise like numpy now do for __array__ when a copy can't be made or is this a moot point when CoW is extended to the Indexes as the copy keyword would be irrelevant?

Ah, sorry, I assumed you were commenting on what is being tested here, i.e. the obj.to_numpy(copy=False) copying or not.

For pd.Index(arr, copy=False): in general our copy keywords in constructors are not strict, but only mean to avoid a copy at "best effort" (e.g. also if you pass a python list, it will make a copy regardless of that keyword). If we would want to change that more generally, that's a bigger topic to discuss.

this a moot point when CoW is extended to the Indexes as the copy keyword would be irrelevant?

It's certainly not a moot point, because with CoW we actually copy more often on input with non-pandas objects. Although it seems we didn't make that change for Index(..), where the default is still copy=False

simonjayhawkins · 2024-11-18T10:01:21Z

pandas/tests/indexes/multi/test_setops.py

 def test_union_with_na_when_constructing_dataframe():
    # GH43222
    series1 = Series(
        (1,),
        index=MultiIndex.from_arrays(
-            [Series([None], dtype="string"), Series([None], dtype="string")]
+            [Series([None], dtype="str"), Series([None], dtype="str")]


So this does fix the test. Is the behavior if this change to the test is not made correct?

i.e. the series1.index.dtypes are object and the series2.index.dtypes are str and the resulting dtype for the columns index using the DataFrame constructor is object. Would we not expect the DataFrame constructor to return a str index for the columns in this case?

perhaps related to #60338?

i.e. the series1.index.dtypes are object and the series2.index.dtypes are str and the resulting dtype for the columns index using the DataFrame constructor is object.

With the above fix (and when infer_string is enabled), the test uses str dtype for the index levels of both series1 and series2, and then also the expected result gets created with that.

So it's only testing that the NaNs are properly matched when creating the rows from Series objects with a MultiIndex, it does not test having different dtypes in series1 vs series2.

That's also something we could test, though (and then if you have object dtype index and str dtype index, one would expect the result to be object dtype index (since that is the "common" dtype), but was not how the test was currently set up.

perhaps related to #60338?

I suppose not because here we don't actually have empty objects. Both series1 and series2 have data and have a non-empty index.

simonjayhawkins · 2024-11-18T11:25:52Z

pandas/tests/reshape/test_union_categoricals.py

-            request.applymarker(pytest.mark.xfail("object and strings dont match"))
+            request.applymarker(
+                pytest.mark.xfail(
+                    reason="TDOD(infer_string) object and strings dont match"


There's a typo here that may mean this gets missed when grepping the TODOs.

Ah, good catch! Yes, I do grep for that so good to fix that typo

TST (string dtype): clean-up assorted xfails

2470d30

jorisvandenbossche added the Strings String extension data type and string data label Nov 17, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 17, 2024

jorisvandenbossche merged commit e7d1964 into pandas-dev:main Nov 17, 2024
54 of 55 checks passed

jorisvandenbossche deleted the string-dtype-tests-assorted branch November 17, 2024 12:41

lumberbot-app bot added the Still Needs Manual Backport label Nov 17, 2024

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Nov 17, 2024

TST (string dtype): clean-up assorted xfails (pandas-dev#60345)

7e191dd

(cherry picked from commit e7d1964)

jorisvandenbossche mentioned this pull request Nov 17, 2024

[backport 2.3.x] TST (string dtype): clean-up assorted xfails (#60345) #60349

Open

jorisvandenbossche removed the Still Needs Manual Backport label Nov 17, 2024

simonjayhawkins reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST (string dtype): clean-up assorted xfails #60345

TST (string dtype): clean-up assorted xfails #60345

jorisvandenbossche commented Nov 17, 2024

lumberbot-app bot commented Nov 17, 2024

jorisvandenbossche commented Nov 17, 2024

simonjayhawkins Nov 18, 2024

jorisvandenbossche Nov 18, 2024

simonjayhawkins Nov 18, 2024

jorisvandenbossche Nov 18, 2024

simonjayhawkins Nov 18, 2024

simonjayhawkins Nov 18, 2024

jorisvandenbossche Nov 18, 2024

simonjayhawkins Nov 18, 2024

jorisvandenbossche Nov 18, 2024

TST (string dtype): clean-up assorted xfails #60345

TST (string dtype): clean-up assorted xfails #60345

Conversation

jorisvandenbossche commented Nov 17, 2024

lumberbot-app bot commented Nov 17, 2024

jorisvandenbossche commented Nov 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment