-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792
Comments
One general question is about naming: currently, this data type is implemented as I don't think "pyarrow_numpy" is a great name, but we also couldn't directly think of something better. The In general, ideally not too many users should actually directly use the term "pyarrow_numpy". When the future option is enabled, I think we should ensure one can simply use eg |
I recognize I'm late to this, but out of curiosity, why use pyarrow string arrays instead of using numpy structured types for unicode strings (e.g., I understand the performance issues with pandas |
NumPy only supports rectangular arrays of strings. So '<U12" requires 12*4 bytes (used UTF-32 encoding) for every entry irrespective of size or the characters used. More efficient storage methods use ragged arrays where usually 2 things are stores in the array, a memory address of the actual UTF8 string and the length of the string. Consider the sime example
In NumPy this array requires 96 buytes for storage (+ overheads). In an efficient encoding this requires something on the order of 1 (a) + 8 (memory address) + 8 (length, assuming int64) + 12 (other string) + 8 + 8 = 45, which is about 50%. If an array is very sparse (say has one very long string, and the rest short, then the ratio os space required can get really bad). NumPy is working on 1st party support for ragged UTF strings in NumPy 2.0(ish) which requires a new way to define dtypes.. |
Ah, ok, thanks! I knew about that issue, but hadn't realized Arrow strings did something different (which I infer is the case from context). Appreciate the clarification! |
After discussing this on slack I don't think that the new string dtype should use pyarrow storage with numpy NaN semantics. That may help internal development transition to 3.0, but makes for a confusing long term strategy for our end users. I feel like we are going to pay a heavy price over time to constantly clarify what the return type should be for various algorithms @phofl pointed out to me in slack that we would have the following string types:
And with NumPy 2.0 there is the potential for a native NumPy string dtype. So if we take these and apply an operation like
The only type I was expecting from the above to be yielded is The sheer number of iterations that can be produced from these different variants gets very confusing; I think the simpler story in the long term is that we just have NumPy / Arrow arrays and any algorithms applied to those yield whatever the underlying library provides |
To be clear, the motivation of this was not for internal development ease (no, we actually needed to add yet another ExtensionArray for it), but to help user transition (less breaking change for users)
I won't deny that this listing is confusing ... but I would personally argue that the current plan reduces the number of variants for users. Currently, with the planned "pyarrow_numpy" variant as the default string dtype (and without enabling a custom experimental setting), a user will only see 1 string dtype, and only 1 kind of integer type, only 1 bool type, etc. If we would let the default string dtype return pyarrow types, and you have a workflow with some operations that involve string columns, you can end up with a dataframe with both numpy and pyarrow dtypes, while the user never asked for pyarrow columns in the first place. And then you have one numeric column that uses NaN as the missing value, and one numeric column that use NA as the missing value (and treats NaN as not missing). Or you have one datetime column that has a datetime64[ns] dtype, and another datetime column that uses timestamp[us][pyarrow].
We are certainly not there (and need to discuss this more), but IMO the simpler story in the long term is that we just have pandas arrays and data types (and the average user doesn't have to care about the whether it's numpy or pyarrow under the hood) |
Thanks for those clarifications @jorisvandenbossche - very insightful.
I recognize this is not ideal but I'm also not sure it is that big of a problem given pandas type system history. Is it that different from: ser = pd.Series(["abc", None])
ser.str.len()
0 3.0
1 NaN
dtype: float64 Giving a different return type than:
? Especially for primitive types I don't see the distinction between pyarrow / numpy array types being all that important, particularly since those can be zero-copy.
The problem I foresee with this is it liimits what users to do to the common denominator of the underlying libraries. If coming from Arrow, you lose streaming support, bitmasking and nullability handling when trying to make a compatability layer with NumPy. For the inverse, your arrays become limited to 1-D. For types that exist in one library or the other, we would arguably be adding another layer that just isn't necessary. I think doing this prevents us from really utilizing the strengths of either library If users wanted to stick with NumPy semantics exclusively I think the new NumPy string dtype should be the right choice in the long term. I don't believe that existed at the time of this original conversation, but it may now negate the need for a |
I think the way they treat NA values differently in comparisons to be quite important. |
I agree. TBH, it seems like the decision to make the semantics of using As I understand it, the |
If you don't want pyarrow nullability what is the advantage of using a pyarrow array with numpy semantics versus just a numpy array?
Arrow uses a validity bitmask whereas numpy doesn't offer anything outside of IEEE 754 floating point arithmetic, which is still applicable within Arrow computations >>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import numpy as np
>>> pc.equal(pa.array([1., None, np.nan]), pa.array([1., None, np.nan]))
<pyarrow.lib.BooleanArray object at 0x79189d8ad960>
[
true,
null,
false
] Though I'm not clear on why this matters for algorithms against a string type? |
I don't think we are talking about the same thing. Even if we agree that it doesn't matter for string columns, it matters for all columns that you create from the string columns, e.g. ser.str.len returns int64[pyarrow] and thus you have a columns that behaves different than your neighbouring columns with int64, this is a very very bad ux |
How does a |
@WillAyd - this code successfully detects NA values with NumPy dtypes, but not pyarrow:
|
Might be some confusion over null versus nan. pyarrow works like what is discussed in #32265 . null uses Kleene logic for comparisons whereas NaN != NaN by definition in IEEE 754. You can use >>> result["b"].isna()
0 False
1 False
2 True
Name: b, dtype: bool |
@WillAyd - sure, but the difference doesn't just come up when you are looking for NA values - it can impact the result of any comparison. I just gave one example. |
Overview of work for the future string dtype (PDEP-14).
Main implementation:
storage="pyarrow_numpy"
tostorage="pyarrow", na_value=np.nan
)na_value
keyword inStringDtype()
#59330"str"
for the NaN-variant of the dtypedtype=str
/astype(str)
works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)Testing related:
future.infer_string
enabled:future.infer_string
(tackle allxfails
/TODO(infer_string)
tests)Open design questions / behaviour changes to implement:
Known bugs that need to be fixed:
testing.assert_frame_equal
unhelpful error message forstring[pyarrow]
#54190Documentation:
[original issue body]
With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.
For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:
This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings
One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg
.str.startswith(..)
,.str.len(..)
,.str.count()
, etc, or comparison operators like==
).(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)
To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses
StringDtype(storage="pyarrow_numpy")
instead ofArrowDtype("string")
. From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will returnnp.nan
as the missing value indicator". Main PR:plus some follow-ups (#54720, #54585, #54591).
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: