-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping #44720
base: main
Are you sure you want to change the base?
Conversation
Addresses pandas-dev/pandas#53011 `types_mapper` always had highest priority as it overrode what was set before. However, switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend. Resolving the issue of complex `dtype` with `list` or `struct`
❌ GitHub issue #53011 could not be retrieved. |
|
And because you added a |
python/pyarrow/tests/test_pandas.py
Outdated
# Round trip df0 into df1 | ||
with io.BytesIO() as stream: | ||
df0.to_parquet(stream, schema=schema) | ||
stream.seek(0) | ||
df1 = pd.read_parquet(stream, dtype_backend="pyarrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need the roundtrip to parquet, but a table = pa.table(df); result = table.to_pandas(types_mapper=pd.ArrowDtype)
should be sufficient to test this?
I know this doesn't test exactly pd.read_parquet
in its entirety, but it should test the relevant part on the pyarrow side, and an actual pd.read_parquet test can still be added to pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error only gets thrown once the pandas metadata is added to the table. That's why I have used a round-trip test. Is there another way to generate that metadata and set it on the table before calling to_pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata gets added on the pyarrow side, so table = pa.table(df)
will do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pandas to_parquet
method essentially just does a table = pa.Table.from_pandas(df)
and then writes that to parquet (and pa.table(df)
is a shorter less explicit version of that, but you can also use Table.from_pandas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 1c23076
Yes, exactly. Priority remains the same, but functions are skipped if the field already has a type, meaning that the code causing the error is no longer called if types_mapper is provided. |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I triggered CI again
Thanks @jorisvandenbossche -- is the process that I can merge this following approval, or is that done by a core maintainer? |
A committer will merge, probably @jorisvandenbossche in this specific case, once everything is running and addressed. I've triggered CI for the latest changes. |
@github-actions crossbow submit -g python |
Revision: e3b9892 Submitted crossbow builds: ursacomputing/crossbow @ actions-e01b93275b |
@raulcd it seems something is going wrong with the minimal test builds (eg example-python-minimal-build-fedora-conda). The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892", which then messes up pandas detection of the pyarrow version (for the pyarrow integration in pandas, pandas checks if pyarrow is recent enough and otherwise errors), giving some test failures. (but also not entirely sure how this PR causes this issue, since I don't see the nightlies fail for the minimal builds at the moment) |
(the other failures are the known nightly dlpack failures) |
From the git checkout I see is pulling from the remote on |
I've opened an issue because we should find a way to not fail if the dev tag is not present: |
Thanks for investigating that! So then to resolve this here, @bretttully should fetch the upstream tags and push that to his fork? Something like
(assuming upstream is apache/arrow and origin is bretttully/arrow) |
I have merged |
@github-actions crossbow submit example-python-minimal-build-* |
Revision: 685167f Submitted crossbow builds: ursacomputing/crossbow @ actions-524e782c26
|
Rationale for this change
This is a long standing pandas ticket with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However,
types_mapper
always had highest priority as it overrode what was set before.What changes are included in this PR?
By switching the logical ordering, it means that we don't need to call
_pandas_api.pandas_dtype(dtype)
when using the pyarrow backend, thus resolving the issue of complexdtype
withlist
orstruct
. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all.Are these changes tested?
Existing tests should stay unchanged and a new test for the complex type has been added
Are there any user-facing changes?
This PR contains a "Critical Fix".
This makes
pd.read_parquet(..., dtype_backend="pyarrow")
work with complex data types where the metadata added by pyarrow duringpd.to_parquet
is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas.