TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

jorisvandenbossche · 2023-08-28T09:47:08Z

Overview of work for the future string dtype (PDEP-14).

Main implementation:

Testing related:

Open design questions / behaviour changes to implement:

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805
API: behaviour for the "str.center()" string method for the pyarrow-backed string dtype #54807
Default string dtype should not raise fallback performance warnings #58581

Known bugs that need to be fixed:

Documentation:

List and document all breaking / behaviour changes in upgrade guide
- String dtype: overview of breaking behaviour changes #59328

[original issue body]

With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.

For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:

>>> pd.options.future.infer_string = True

>>> pd.Series(["a", "b", None])
0      a
1      b
2    NaN
dtype: string

>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]

This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings

One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..), .str.len(..), .str.count(), etc, or comparison operators like ==).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)

To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy") instead of ArrowDtype("string"). From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator". Main PR:

Implement Arrow String Array that is compatible with NumPy semantics #54533

plus some follow-ups (#54720, #54585, #54591).

cc @pandas-dev/pandas-core

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-08-28T10:01:51Z

One general question is about naming: currently, this data type is implemented as StringDtype(storage="pyarrow_numpy") (a parametrized instance of the existing StringDtype), thus using the "pyarrow_numpy" name (which can also be seen in the repr of the dtype, as "string[pyarrow_numpy]").
And the actual array class used is ArrowStringArrayNumpySemantics, expanding the existing ArrowStringArray.

I don't think "pyarrow_numpy" is a great name, but we also couldn't directly think of something better. The StringDtype(storage="pyarrow") is already taken by the existing version of that (using "nullable" semantics), and changing that to mean this new string dtype would break existing users of this (although we officially still label this as experimental, and could still change it).

In general, ideally not too many users should actually directly use the term "pyarrow_numpy". When the future option is enabled, I think we should ensure one can simply use eg astype("string") and get this new default (without having to do astype("string[pyarrow_numpy]") (note: this is currently not yet the case, opening a separate issue to discuss this -> #54793)

nickeubank · 2023-09-11T14:28:21Z

I recognize I'm late to this, but out of curiosity, why use pyarrow string arrays instead of using numpy structured types for unicode strings (e.g., dtype='<U12' or whatever length)?

I understand the performance issues with pandas object dtype (it's just an array of references to Python strings), but I thought the structured numpy unicode dtypes avoided all these issues, and wouldn't require a mixed-syntax/implementation.

bashtage · 2023-09-11T14:35:57Z

I recognize I'm late to this, but out of curiosity, why use pyarrow string arrays instead of using numpy structured types for unicode strings (e.g., dtype='<U12' or whatever length)?

NumPy only supports rectangular arrays of strings. So '<U12" requires 12*4 bytes (used UTF-32 encoding) for every entry irrespective of size or the characters used. More efficient storage methods use ragged arrays where usually 2 things are stores in the array, a memory address of the actual UTF8 string and the length of the string. Consider the sime example

a
abcdefghijkl

In NumPy this array requires 96 buytes for storage (+ overheads). In an efficient encoding this requires something on the order of 1 (a) + 8 (memory address) + 8 (length, assuming int64) + 12 (other string) + 8 + 8 = 45, which is about 50%. If an array is very sparse (say has one very long string, and the rest short, then the ratio os space required can get really bad).

NumPy is working on 1st party support for ragged UTF strings in NumPy 2.0(ish) which requires a new way to define dtypes..

nickeubank · 2023-09-11T14:53:24Z

Ah, ok, thanks! I knew about that issue, but hadn't realized Arrow strings did something different (which I infer is the case from context). Appreciate the clarification!

WillAyd · 2024-02-16T13:42:07Z

After discussing this on slack I don't think that the new string dtype should use pyarrow storage with numpy NaN semantics. That may help internal development transition to 3.0, but makes for a confusing long term strategy for our end users. I feel like we are going to pay a heavy price over time to constantly clarify what the return type should be for various algorithms

@phofl pointed out to me in slack that we would have the following string types:

string[pyarrow_numpy]
string[pyarrow]
pd.ArrowDtype(pa.string())

And with NumPy 2.0 there is the potential for a native NumPy string dtype.

So if we take these and apply an operation like .str.len you may get the following return types:

string[pyarrow_numpy] - yields np.int64 or np.float64 (depending on presence of NaNs)
string[pyarrow] - yields Int64
pd.ArrowDtype(pa.string()) - yields int64[pyarrow]
NumPy 2.0 string type - TBD

The only type I was expecting from the above to be yielded is int64[pyarrow], which is also tucked away in arguably the most obscure string dtype. Stated explicitly, I think pandas 3.0 should use string[pyarrow] as the default and have algorithms applied to that return pyarrow types

The sheer number of iterations that can be produced from these different variants gets very confusing; I think the simpler story in the long term is that we just have NumPy / Arrow arrays and any algorithms applied to those yield whatever the underlying library provides

jorisvandenbossche · 2024-02-16T14:48:30Z

That may help internal development transition to 3.0

To be clear, the motivation of this was not for internal development ease (no, we actually needed to add yet another ExtensionArray for it), but to help user transition (less breaking change for users)

The sheer number of iterations that can be produced from these different variants gets very confusing

I won't deny that this listing is confusing ... but I would personally argue that the current plan reduces the number of variants for users. Currently, with the planned "pyarrow_numpy" variant as the default string dtype (and without enabling a custom experimental setting), a user will only see 1 string dtype, and only 1 kind of integer type, only 1 bool type, etc.

If we would let the default string dtype return pyarrow types, and you have a workflow with some operations that involve string columns, you can end up with a dataframe with both numpy and pyarrow dtypes, while the user never asked for pyarrow columns in the first place. And then you have one numeric column that uses NaN as the missing value, and one numeric column that use NA as the missing value (and treats NaN as not missing). Or you have one datetime column that has a datetime64[ns] dtype, and another datetime column that uses timestamp[us][pyarrow].
I think such a situation is much more confusing. Especially because this would happen for all users by default, while you only get the variations in string dtypes you list above when specifically enabling some option.

I think the simpler story in the long term is that we just have NumPy / Arrow arrays and any algorithms applied to those yield whatever the underlying library provides

We are certainly not there (and need to discuss this more), but IMO the simpler story in the long term is that we just have pandas arrays and data types (and the average user doesn't have to care about the whether it's numpy or pyarrow under the hood)

WillAyd · 2024-02-16T15:21:54Z

Thanks for those clarifications @jorisvandenbossche - very insightful.

If we would let the default string dtype return pyarrow types, and you have a workflow with some operations that involve string columns, you can end up with a dataframe with both numpy and pyarrow dtypes, while the user never asked for pyarrow columns in the first place

I recognize this is not ideal but I'm also not sure it is that big of a problem given pandas type system history. Is it that different from:

ser = pd.Series(["abc", None])
ser.str.len()
0    3.0
1    NaN
dtype: float64

Giving a different return type than:

ser = pd.Series(["abc", "def"])
ser.str.len()
0    3
1    3
dtype: int64

? Especially for primitive types I don't see the distinction between pyarrow / numpy array types being all that important, particularly since those can be zero-copy.

but IMO the simpler story in the long term is that we just have pandas arrays and data types (and the average user doesn't have to care about the whether it's numpy or pyarrow under the hood)

The problem I foresee with this is it liimits what users to do to the common denominator of the underlying libraries. If coming from Arrow, you lose streaming support, bitmasking and nullability handling when trying to make a compatability layer with NumPy. For the inverse, your arrays become limited to 1-D. For types that exist in one library or the other, we would arguably be adding another layer that just isn't necessary. I think doing this prevents us from really utilizing the strengths of either library

If users wanted to stick with NumPy semantics exclusively I think the new NumPy string dtype should be the right choice in the long term. I don't believe that existed at the time of this original conversation, but it may now negate the need for a pyarrow_numpy string dtype. Over the long run having IO methods that use a dtype_backend="numpy_nullable" but that don't return NumPy strings I think is also going to be confusing

rhshadrach · 2024-02-16T21:12:26Z

Especially for primitive types I don't see the distinction between pyarrow / numpy array types being all that important,

I think the way they treat NA values differently in comparisons to be quite important.

Dr-Irv · 2024-02-16T22:10:31Z

I think the way they treat NA values differently in comparisons to be quite important.

I agree. TBH, it seems like the decision to make the semantics of using pyarrow strings like numpy in terms of missing values is related to the issue of how we do numpy to pandas conversions when we have pd.NA and np.nan present in a float array.

As I understand it, the pyarrow semantics for missing values is to use a missing value sentinel similar to pd.NA (I think that is pyarrow.NA). That's in contrast with the numpy semantics of using np.nan to represent a missing value. So the string[pyarrow_numpy] makes those 2 sentinels equivalent. But then we have to decide when we do operations (e.g., len()) that are not returning a string type on a pyarrow string array that has missing values whether we return np.nan or pd.NA or pyarrow.NA to correspond to the string missing value entries, or, equivalently returning a numpy array using np.nan, or a pyarrow array using pyarrow.NA, or a pandas extension array using pd.NA.

WillAyd · 2024-02-16T22:54:07Z

If you don't want pyarrow nullability what is the advantage of using a pyarrow array with numpy semantics versus just a numpy array?

As I understand it, the pyarrow semantics for missing values is to use a missing value sentinel similar to pd.NA (I think that is pyarrow.NA). That's in contrast with the numpy semantics of using np.nan to represent a missing value.

Arrow uses a validity bitmask whereas numpy doesn't offer anything outside of IEEE 754 floating point arithmetic, which is still applicable within Arrow computations

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import numpy as np
>>> pc.equal(pa.array([1., None, np.nan]), pa.array([1., None, np.nan]))
<pyarrow.lib.BooleanArray object at 0x79189d8ad960>
[
  true,
  null,
  false
]

Though I'm not clear on why this matters for algorithms against a string type?

phofl · 2024-02-16T22:58:22Z

Though I'm not clear on why this matters for algorithms against a string type?

I don't think we are talking about the same thing. Even if we agree that it doesn't matter for string columns, it matters for all columns that you create from the string columns, e.g.

ser.str.len returns int64[pyarrow] and thus you have a columns that behaves different than your neighbouring columns with int64, this is a very very bad ux

WillAyd · 2024-02-16T23:23:33Z

How does a pyarrow[int64] behave differently than a np.int64? The underlying data buffers for these is going to be identical

rhshadrach · 2024-02-17T11:19:44Z

@WillAyd - this code successfully detects NA values with NumPy dtypes, but not pyarrow:

df1 = pd.DataFrame({"a": [1, 1, 2]}, dtype="int64[pyarrow]")
df2 = pd.DataFrame({"a": [1, 3], "b": [6, 7]}, dtype="int64[pyarrow]")
result = df1.merge(df2, on="a", how="left")
# Check for NA values
if (result["b"] != result["b"]).any():
    raise ValueError("Whoopsie!")

WillAyd · 2024-02-17T15:46:01Z

Might be some confusion over null versus nan. pyarrow works like what is discussed in #32265 . null uses Kleene logic for comparisons whereas NaN != NaN by definition in IEEE 754. You can use pc.is_null and/or pc.is_nan to distinguish as needed. Our own isna implementation must already wrap the latter

>>> result["b"].isna()
0    False
1    False
2     True
Name: b, dtype: bool

rhshadrach · 2024-02-17T17:48:31Z

@WillAyd - sure, but the difference doesn't just come up when you are looking for NA values - it can impact the result of any comparison. I just gave one example.

jorisvandenbossche added API Design Strings String extension data type and string data Arrow pyarrow functionality labels Aug 28, 2023

This was referenced Aug 28, 2023

Use NaN as na_value for new pyarrow_numpy StringDtype #54585

Merged

Implement Arrow String Array that is compatible with NumPy semantics #54533

Merged

Infer strings as pyarrow_numpy backed strings #54720

Merged

jorisvandenbossche mentioned this issue Oct 26, 2023

Adjust Series specific tests for string option #55538

Merged

5 tasks

jorisvandenbossche added this to the 3.0 milestone Oct 26, 2023

jorisvandenbossche mentioned this issue Nov 30, 2023

BUG: new string dtype fails with >2 GB of data in a single column #56259

Closed

jorisvandenbossche mentioned this issue Apr 27, 2024

DISC: Consider not requiring PyArrow in 3.0 #57073

Open

This was referenced Oct 9, 2024

RLS: 2.3 #59664

Open

BUG (string dtype): fix inplace mutation with copy=False in ensure_string_array #59756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading

nickeubank commented Sep 11, 2023

bashtage commented Sep 11, 2023

nickeubank commented Sep 11, 2023

WillAyd commented Feb 16, 2024

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

WillAyd commented Feb 16, 2024

rhshadrach commented Feb 16, 2024

Dr-Irv commented Feb 16, 2024

WillAyd commented Feb 16, 2024

phofl commented Feb 16, 2024

WillAyd commented Feb 16, 2024

rhshadrach commented Feb 17, 2024 •

edited

Loading

WillAyd commented Feb 17, 2024

rhshadrach commented Feb 17, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Comments

jorisvandenbossche commented Aug 28, 2023 • edited Loading

jorisvandenbossche commented Aug 28, 2023 • edited Loading

nickeubank commented Sep 11, 2023

bashtage commented Sep 11, 2023

nickeubank commented Sep 11, 2023

WillAyd commented Feb 16, 2024

jorisvandenbossche commented Feb 16, 2024 • edited Loading

WillAyd commented Feb 16, 2024

rhshadrach commented Feb 16, 2024

Dr-Irv commented Feb 16, 2024

WillAyd commented Feb 16, 2024

phofl commented Feb 16, 2024

WillAyd commented Feb 16, 2024

rhshadrach commented Feb 17, 2024 • edited Loading

WillAyd commented Feb 17, 2024

rhshadrach commented Feb 17, 2024

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading

jorisvandenbossche commented Aug 28, 2023 •

edited

Loading

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

rhshadrach commented Feb 17, 2024 •

edited

Loading