Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): extension array indexers #9671

Open
wants to merge 201 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

Identical to kmuehlbauer#1 - probably not very helpful in terms of changes since https://github.com/kmuehlbauer/xarray/tree/any-time-resolution-2 contains most of it....

kmuehlbauer and others added 30 commits October 18, 2024 07:31
…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution
…t resolution, fix code and tests to allow this
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)
…rray` series creating an extension array when `.array` is accessed
@ilan-gold
Copy link
Contributor Author

Great @kmuehlbauer - I want the maintainers to look at the MyPy. I could in theory fix it, but I would basically be guessing at what their wishes are for the classes' return types.

) -> np.ndarray:
if dtype is None:
dtype = self.dtype
if pd.api.types.is_extension_array_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed? Why would someone call np.array with an extension dtype, and then expect it to get translated to a numpy dtype?

Copy link
Contributor Author

@ilan-gold ilan-gold Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for internal usage, otherwise I wouldn't have added it. I can delete the line and then see what happens, and then comment.

Copy link
Contributor Author

@ilan-gold ilan-gold Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian This class is basically an internal adapter so anything that asks for its data in numpy form will call this. Things like repr, subtraction, and calling .values on an xarray object are a few examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand that. Why doesn't the last line (return super().__array__(dtype, copy=copy)) "just" handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hh, good point, yes, it's unnecessary. Fixed.

) -> np.ndarray:
if dtype is None:
dtype = self.dtype
if pd.api.types.is_extension_array_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. Why is this needed?

@@ -6875,7 +6875,7 @@ def groupby(
[[nan, nan, nan],
[ 3., 4., 5.]]])
Coordinates:
* x_bins (x_bins) object 16B (5, 15] (15, 25]
* x_bins (x_bins) interval[int64, right] 16B (5, 15] (15, 25]
Copy link
Contributor

@dcherian dcherian Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is amazing, it enables IntervalIndex indexing now.

cc @benbovy

@dcherian
Copy link
Contributor

@Illviljan or @headtr1ck can you take a look at the typing failure please

xarray/namedarray/core.py Outdated Show resolved Hide resolved
Comment on lines 838 to 840
if pd.api.types.is_extension_array_dtype(data_old.dtype):
# One of PandasExtensionArray or PandasIndexingAdapter?
ndata = data_old.array.to_numpy()
Copy link
Contributor

@Illviljan Illviljan Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.api.types.is_extension_array_dtype(data_old.dtype) does not imply data_old is an extension array.
You probably should use some kind of isinstance-check to be able to use .array.

I haven't used extension arrays myself that much, why can't a simple np.asarray(data_old) be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I was able to resolve both comments with no impact. I think pd.api.types.is_extension_array_dtype is correct as from the docs: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_extension_array_dtype.html:

This checks whether an object implements the pandas extension array interface. In pandas, this includes:

As for the to_numpy I think I just forgot that the incoming object has a __array__ implementation

if pd.api.types.is_extension_array_dtype(self._data) and isinstance(
self._data, PandasIndexingAdapter
):
return self._data.array
return self._data.get_duck_array()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So get_duck_array should handle this by possibly converting to a numpy array. Is that not desired? It is ambiguous, to me at least, if we consider ExtensionArrays a "duck array"

Copy link
Contributor Author

@ilan-gold ilan-gold Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, at the moment get_duck_array on the PandasIndexingAdapter returns a numpy array, and the docstring states:

The Variable's data as an array. The underlying array type
        (e.g. dask, sparse, pint) is preserved.

which means we probably want to conserve the pandas array typing. But, maybe the implementation of get_duck_array should be updated

Copy link
Contributor Author

@ilan-gold ilan-gold Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian I see how deep the issue goes re: typing. I am not sure if you want the extension arrays to satisfy this constraint, but it would make the typing on NamedArray a bit more coherent (see 459f56b and then the subsequent volley of failures as a result)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've just added a few ignores. I think this issue is a bit separate since it doesn't actually affect the runtime behavior

@Illviljan
Copy link
Contributor

A surprise to me is that Extension arrays doesn't implement __array_function__.
So there's no promise that np.mean(extension_array) will work. Now numpy has been clever and tries extension_array.mean before attempting it, so maybe a non-issue in practice?

@ilan-gold
Copy link
Contributor Author

So there's no promise that np.mean(extension_array) will work

Right, nor should it for the number of extension array types there are (what's the mean of a categorical? of an interval?)

Now numpy has been clever and tries extension_array.mean before attempting it, so maybe a non-issue in practice?

We have a wrapper around extension arrays that does implement __array_function__ so that might be why they fulfill the NamedArray variable. I can look into loosening the dtype restrictions on NamedArray and that might help

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Feb 7, 2025

I think part of the reason this is leaking is that as_compatible_data returns a generic T_DuckArray which has only an Any bound (and Variable allows its data initialization argument to be T_DuckArray). So that's why nothing has complained before:

def __init__(
self,
dims,
data: T_DuckArray | ArrayLike,
attrs=None,
encoding=None,
fastpath=False,
):
"""
Parameters
----------
dims : str or sequence of str
Name(s) of the the data dimension(s). Must be either a string (only
for 1D data) or a sequence of strings with length equal to the
number of dimensions.
data : array_like
Data array which supports numpy-like data access.
attrs : dict_like or None, optional
Attributes to assign to the new variable. If None (default), an
empty attribute dictionary is initialized.
(see FAQ, :ref:`approach to metadata`)
encoding : dict_like or None, optional
Dictionary specifying how to encode this array's data into a
serialized format like netCDF4. Currently used keys (for netCDF)
include '_FillValue', 'scale_factor', 'add_offset' and 'dtype'.
Well-behaved code to serialize a Variable should ignore
unrecognized encoding items.
"""
super().__init__(
dims=dims, data=as_compatible_data(data, fastpath=fastpath), attrs=attrs
)

I do think this is basically a non-issue since PandasIndexingAdapter can also be returned from as_compatible_array and that also doesn't fulfill the NamedArray typing scheme since it also doesn't implement imag or real on _arrayfunction (or even __array_function__!!! at least the PandasExtensionArray class does that!!!):

@property
def imag(self) -> _arrayfunction[_ShapeType_co, Any]: ...
@property
def real(self) -> _arrayfunction[_ShapeType_co, Any]: ...

IMO we should punt on the typing issue beause it existed before this PR as far as I can tell

@headtr1ck
Copy link
Collaborator

I'm ok with leaving the ignore, not please open a new issue about it so we can keep track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants