(fix): extension array indexers #9671

ilan-gold · 2024-10-24T15:37:01Z

Identical to kmuehlbauer#1 - probably not very helpful in terms of changes since https://github.com/kmuehlbauer/xarray/tree/any-time-resolution-2 contains most of it....

Closes #Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

…ocessing, raise now early

…_ref_date

…o fix mypy

…t resolution, fix code and tests to allow this

for more information, see https://pre-commit.ci

…-resolution

… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)

…rray` series creating an extension array when `.array` is accessed

ilan-gold · 2025-01-16T09:57:57Z

Great @kmuehlbauer - I want the maintainers to look at the MyPy. I could in theory fix it, but I would basically be guessing at what their wishes are for the classes' return types.

dcherian · 2025-01-30T04:47:36Z

xarray/core/indexing.py

    ) -> np.ndarray:
        if dtype is None:
            dtype = self.dtype
+        if pd.api.types.is_extension_array_dtype(dtype):


is this needed? Why would someone call np.array with an extension dtype, and then expect it to get translated to a numpy dtype?

This is for internal usage, otherwise I wouldn't have added it. I can delete the line and then see what happens, and then comment.

@dcherian This class is basically an internal adapter so anything that asks for its data in numpy form will call this. Things like repr, subtraction, and calling .values on an xarray object are a few examples

i understand that. Why doesn't the last line (return super().__array__(dtype, copy=copy)) "just" handle this?

Hh, good point, yes, it's unnecessary. Fixed.

dcherian · 2025-01-30T04:51:02Z

xarray/core/indexing.py

    ) -> np.ndarray:
        if dtype is None:
            dtype = self.dtype
+        if pd.api.types.is_extension_array_dtype(dtype):


same here. Why is this needed?

dcherian · 2025-01-30T04:54:55Z

xarray/core/dataarray.py

@@ -6875,7 +6875,7 @@ def groupby(
               [[nan, nan, nan],
                [ 3.,  4.,  5.]]])
        Coordinates:
-          * x_bins   (x_bins) object 16B (5, 15] (15, 25]
+          * x_bins   (x_bins) interval[int64, right] 16B (5, 15] (15, 25]


this is amazing, it enables IntervalIndex indexing now.

cc @benbovy

dcherian · 2025-01-30T04:56:05Z

@Illviljan or @headtr1ck can you take a look at the typing failure please

xarray/namedarray/core.py

Illviljan · 2025-01-30T18:40:31Z

xarray/namedarray/core.py

+            if pd.api.types.is_extension_array_dtype(data_old.dtype):
+                # One of PandasExtensionArray or PandasIndexingAdapter?
+                ndata = data_old.array.to_numpy()


pd.api.types.is_extension_array_dtype(data_old.dtype) does not imply data_old is an extension array.
You probably should use some kind of isinstance-check to be able to use .array.

I haven't used extension arrays myself that much, why can't a simple np.asarray(data_old) be used?

Seems I was able to resolve both comments with no impact. I think pd.api.types.is_extension_array_dtype is correct as from the docs: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_extension_array_dtype.html:

This checks whether an object implements the pandas extension array interface. In pandas, this includes:

As for the to_numpy I think I just forgot that the incoming object has a __array__ implementation

dcherian · 2025-02-03T17:04:01Z

xarray/core/variable.py

+            if pd.api.types.is_extension_array_dtype(self._data) and isinstance(
+                self._data, PandasIndexingAdapter
+            ):
+                return self._data.array
            return self._data.get_duck_array()


So get_duck_array should handle this by possibly converting to a numpy array. Is that not desired? It is ambiguous, to me at least, if we consider ExtensionArrays a "duck array"

Right, at the moment get_duck_array on the PandasIndexingAdapter returns a numpy array, and the docstring states:

The Variable's data as an array. The underlying array type (e.g. dask, sparse, pint) is preserved.

which means we probably want to conserve the pandas array typing. But, maybe the implementation of get_duck_array should be updated

@dcherian I see how deep the issue goes re: typing. I am not sure if you want the extension arrays to satisfy this constraint, but it would make the typing on NamedArray a bit more coherent (see 459f56b and then the subsequent volley of failures as a result)

For now I've just added a few ignores. I think this issue is a bit separate since it doesn't actually affect the runtime behavior

xarray/plot/dataarray_plot.py

Illviljan · 2025-02-06T20:51:25Z

A surprise to me is that Extension arrays doesn't implement __array_function__.
So there's no promise that np.mean(extension_array) will work. Now numpy has been clever and tries extension_array.mean before attempting it, so maybe a non-issue in practice?

ilan-gold · 2025-02-07T09:26:04Z

So there's no promise that np.mean(extension_array) will work

Right, nor should it for the number of extension array types there are (what's the mean of a categorical? of an interval?)

Now numpy has been clever and tries extension_array.mean before attempting it, so maybe a non-issue in practice?

We have a wrapper around extension arrays that does implement __array_function__ so that might be why they fulfill the NamedArray variable. I can look into loosening the dtype restrictions on NamedArray and that might help

ilan-gold · 2025-02-07T10:33:44Z

I think part of the reason this is leaking is that as_compatible_data returns a generic T_DuckArray which has only an Any bound (and Variable allows its data initialization argument to be T_DuckArray). So that's why nothing has complained before:

xarray/xarray/core/variable.py

Lines 336 to 366 in d57f05c

    
               def __init__( 
        
                   self, 
        
                   dims, 
        
                   data: T_DuckArray | ArrayLike, 
        
                   attrs=None, 
        
                   encoding=None, 
        
                   fastpath=False, 
        
               ): 
        
                   """ 
        
                   Parameters 
        
                   ---------- 
        
                   dims : str or sequence of str 
        
                       Name(s) of the the data dimension(s). Must be either a string (only 
        
                       for 1D data) or a sequence of strings with length equal to the 
        
                       number of dimensions. 
        
                   data : array_like 
        
                       Data array which supports numpy-like data access. 
        
                   attrs : dict_like or None, optional 
        
                       Attributes to assign to the new variable. If None (default), an 
        
                       empty attribute dictionary is initialized. 
        
                       (see FAQ, :ref:`approach to metadata`) 
        
                   encoding : dict_like or None, optional 
        
                       Dictionary specifying how to encode this array's data into a 
        
                       serialized format like netCDF4. Currently used keys (for netCDF) 
        
                       include '_FillValue', 'scale_factor', 'add_offset' and 'dtype'. 
        
                       Well-behaved code to serialize a Variable should ignore 
        
                       unrecognized encoding items. 
        
                   """ 
        
                   super().__init__( 
        
                       dims=dims, data=as_compatible_data(data, fastpath=fastpath), attrs=attrs 
        
                   )

I do think this is basically a non-issue since PandasIndexingAdapter can also be returned from as_compatible_array and that also doesn't fulfill the NamedArray typing scheme since it also doesn't implement imag or real on _arrayfunction (or even __array_function__!!! at least the PandasExtensionArray class does that!!!):

xarray/xarray/namedarray/_typing.py

Lines 188 to 192 in d57f05c

    
           @property 
        
           def imag(self) -> _arrayfunction[_ShapeType_co, Any]: ... 
        
           @property 
        
           def real(self) -> _arrayfunction[_ShapeType_co, Any]: ...

IMO we should punt on the typing issue beause it existed before this PR as far as I can tell

headtr1ck · 2025-02-07T12:24:14Z

I'm ok with leaving the ignore, not please open a new issue about it so we can keep track.

kmuehlbauer and others added 30 commits October 18, 2024 07:31

implement default_precision_timestamp, refactor coding/times.py and c…

7b5f323

…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

align tests with new time resolution behaviour

8784f33

timedelta decoding, fsspec handling

b45ab23

fixes in coding/times.py

39086ef

add docs on time coding

df49a40

attempt fixing doc tests

adb8ca3

fix issue where out-of-bounds floating point values slipped in the pr…

266b1ed

…ocessing, raise now early

convert to UTC first before stripping of tz in _unpack_time_units_and…

6d5f13b

…_ref_date

reorganize pandas compatibility code, remove unneeded code, attempt t…

5d68bfe

…o fix mypy

another attempt to finally fix mypy

07bba69

refactor out _check_date_is_after_shift

6e7f0bb

refactor out _maybe_strip_tz_from_timestamp

b4a49bb

more refactoring in coding.times.py

2e1ff4f

more refactoring in coding.times.py

d5a7da0

minor fix in time-coding.rst

821b68d

set default resolution to "s", which actually means, use pandas lowes…

d066edf

…t resolution, fix code and tests to allow this

Add section for default units, fix options

ed22da1

attempt to fix typing

8bf23f4

attempt to fix typing

c3a2b39

fix scalar datetime/timedelta

3c44aed

fix user docs

48be73a

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ac9983

for more information, see https://pre-commit.ci

Fix variable tests, mostly datetime/timedelta is inittialized with us…

d86ad04

…-resolution

revert changes in _possible_convert_objects, this needs to be checked…

b5d0795

… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)

fix doc link

60324f0

(fix): allow all extension array data types in pandas adapters

c2bc4df

(fix): dataframes have no array attr

84569bc

(fix): allow chunked numpy extension arrays because of `test_pandas_a…

90e390d

…rray` series creating an extension array when `.array` is accessed

(fix): dtypes for PandasIndex

7c32bd0

(chore): remove test for unnecessary conversion

795ecf6

ilan-gold and others added 2 commits January 27, 2025 14:03

Merge branch 'main' into ig/fix_extension_indexer

4d70fd1

Merge branch 'main' into ig/fix_extension_indexer

666d279

dcherian reviewed Jan 30, 2025

View reviewed changes

dcherian mentioned this pull request Jan 30, 2025

Design for IntervalIndex #8005

Open

Illviljan reviewed Jan 30, 2025

View reviewed changes

ilan-gold added 3 commits January 31, 2025 12:16

(fix): try to address pandas comments

3623dd5

(fix): try-logic was not correct

da0ee36

Merge branch 'main' into ig/fix_extension_indexer

b53ba64

dcherian reviewed Feb 3, 2025

View reviewed changes

ilan-gold added 2 commits February 4, 2025 10:07

(refactor): remove super class casting

2068b8c

Merge branch 'main' into ig/fix_extension_indexer

13fc8fe

ilan-gold requested review from Illviljan and dcherian February 4, 2025 10:25

ilan-gold added 5 commits February 4, 2025 12:19

(fix): namedarray mypy issues

067f8f2

(fix): typing patch

459f56b

(fix): revert dataarray dtype change

0aa5862

(fix): just ignore dtype issue

4a73535

(fix): ignore redefinition issue

9b4ce62

Illviljan reviewed Feb 5, 2025

View reviewed changes

xarray/plot/dataarray_plot.py Show resolved Hide resolved

Merge branch 'main' into ig/fix_extension_indexer

1fee9d4

ilan-gold mentioned this pull request Feb 7, 2025

NamedArray typing audit #10036

Open

Merge branch 'main' into ig/fix_extension_indexer

1a7650c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): extension array indexers #9671

(fix): extension array indexers #9671

ilan-gold commented Oct 24, 2024

ilan-gold commented Jan 16, 2025

dcherian Jan 30, 2025

ilan-gold Jan 30, 2025 •

edited

Loading

ilan-gold Jan 30, 2025 •

edited

Loading

dcherian Feb 3, 2025

ilan-gold Feb 4, 2025

dcherian Jan 30, 2025

dcherian Jan 30, 2025 •

edited

Loading

dcherian commented Jan 30, 2025

Illviljan Jan 30, 2025 •

edited

Loading

ilan-gold Jan 31, 2025

dcherian Feb 3, 2025

ilan-gold Feb 4, 2025 •

edited

Loading

ilan-gold Feb 4, 2025 •

edited

Loading

ilan-gold Feb 4, 2025

Illviljan commented Feb 6, 2025

ilan-gold commented Feb 7, 2025

ilan-gold commented Feb 7, 2025 •

edited

Loading

headtr1ck commented Feb 7, 2025

(fix): extension array indexers #9671

Are you sure you want to change the base?

(fix): extension array indexers #9671

Conversation

ilan-gold commented Oct 24, 2024

ilan-gold commented Jan 16, 2025

dcherian Jan 30, 2025

Choose a reason for hiding this comment

ilan-gold Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

ilan-gold Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

dcherian Feb 3, 2025

Choose a reason for hiding this comment

ilan-gold Feb 4, 2025

Choose a reason for hiding this comment

dcherian Jan 30, 2025

Choose a reason for hiding this comment

dcherian Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

dcherian commented Jan 30, 2025

Illviljan Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

ilan-gold Jan 31, 2025

Choose a reason for hiding this comment

dcherian Feb 3, 2025

Choose a reason for hiding this comment

ilan-gold Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

ilan-gold Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

ilan-gold Feb 4, 2025

Choose a reason for hiding this comment

Illviljan commented Feb 6, 2025

ilan-gold commented Feb 7, 2025

ilan-gold commented Feb 7, 2025 • edited Loading

headtr1ck commented Feb 7, 2025

ilan-gold Jan 30, 2025 •

edited

Loading

ilan-gold Jan 30, 2025 •

edited

Loading

dcherian Jan 30, 2025 •

edited

Loading

Illviljan Jan 30, 2025 •

edited

Loading

ilan-gold Feb 4, 2025 •

edited

Loading

ilan-gold Feb 4, 2025 •

edited

Loading

ilan-gold commented Feb 7, 2025 •

edited

Loading