Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping #44720

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bretttully
Copy link

@bretttully bretttully commented Nov 14, 2024

Rationale for this change

This is a long standing pandas ticket with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However, types_mapper always had highest priority as it overrode what was set before.

What changes are included in this PR?

By switching the logical ordering, it means that we don't need to call _pandas_api.pandas_dtype(dtype) when using the pyarrow backend, thus resolving the issue of complex dtype with list or struct. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all.

Are these changes tested?

Existing tests should stay unchanged and a new test for the complex type has been added

Are there any user-facing changes?

This PR contains a "Critical Fix".
This makes pd.read_parquet(..., dtype_backend="pyarrow") work with complex data types where the metadata added by pyarrow during pd.to_parquet is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas.

Addresses pandas-dev/pandas#53011

`types_mapper` always had highest priority as it overrode what was set before. However, switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend. Resolving the issue of complex `dtype` with `list` or `struct`
Copy link

❌ GitHub issue #53011 could not be retrieved.

@github-actions github-actions bot added the awaiting review Awaiting review label Nov 14, 2024
@bretttully bretttully changed the title GH-53011: [pyarrow] Reorder to_pandas extension dtype mapping GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping Nov 14, 2024
Copy link

⚠️ GitHub issue #39914 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member

By switching the logical ordering, it means that we don't need to call _pandas_api.pandas_dtype(dtype) when using the pyarrow backend,

And because you added a name not in ext_columns to the subsequent methods to fill ext_columns, this should preserve the priority of the different methods to determine the pandas dtype? (metadata < pyarrow extension type < types_mapper)

Comment on lines 4432 to 4436
# Round trip df0 into df1
with io.BytesIO() as stream:
df0.to_parquet(stream, schema=schema)
stream.seek(0)
df1 = pd.read_parquet(stream, dtype_backend="pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need the roundtrip to parquet, but a table = pa.table(df); result = table.to_pandas(types_mapper=pd.ArrowDtype) should be sufficient to test this?

I know this doesn't test exactly pd.read_parquet in its entirety, but it should test the relevant part on the pyarrow side, and an actual pd.read_parquet test can still be added to pandas

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error only gets thrown once the pandas metadata is added to the table. That's why I have used a round-trip test. Is there another way to generate that metadata and set it on the table before calling to_pandas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata gets added on the pyarrow side, so table = pa.table(df) will do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas to_parquet method essentially just does a table = pa.Table.from_pandas(df) and then writes that to parquet (and pa.table(df) is a shorter less explicit version of that, but you can also use Table.from_pandas)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1c23076

@bretttully
Copy link
Author

By switching the logical ordering, it means that we don't need to call _pandas_api.pandas_dtype(dtype) when using the pyarrow backend,

And because you added a name not in ext_columns to the subsequent methods to fill ext_columns, this should preserve the priority of the different methods to determine the pandas dtype? (metadata < pyarrow extension type < types_mapper)

Yes, exactly. Priority remains the same, but functions are skipped if the field already has a type, meaning that the code causing the error is no longer called if types_mapper is provided.

@jorisvandenbossche
Copy link
Member

The test_dlpack failure in the tests you can ignore (#44728)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 19, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
I triggered CI again

python/pyarrow/tests/test_pandas.py Show resolved Hide resolved
@bretttully
Copy link
Author

Thanks @jorisvandenbossche -- is the process that I can merge this following approval, or is that done by a core maintainer?

@raulcd
Copy link
Member

raulcd commented Nov 20, 2024

is the process that I can merge this following approval, or is that done by a core maintainer?

A committer will merge, probably @jorisvandenbossche in this specific case, once everything is running and addressed. I've triggered CI for the latest changes.

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 20, 2024
Copy link

Revision: e3b9892

Submitted crossbow builds: ursacomputing/crossbow @ actions-e01b93275b

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 21, 2024
@jorisvandenbossche
Copy link
Member

@raulcd it seems something is going wrong with the minimal test builds (eg example-python-minimal-build-fedora-conda). The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892", which then messes up pandas detection of the pyarrow version (for the pyarrow integration in pandas, pandas checks if pyarrow is recent enough and otherwise errors), giving some test failures.

(but also not entirely sure how this PR causes this issue, since I don't see the nightlies fail for the minimal builds at the moment)

@jorisvandenbossche
Copy link
Member

(the other failures are the known nightly dlpack failures)

@raulcd
Copy link
Member

raulcd commented Nov 21, 2024

The logs indicate "Successfully installed pyarrow-0.1.dev16896+ge3b9892"

From the git checkout I see is pulling from the remote on Syncing repository: bretttully/arrow. I recall an issue if dev tags are not present we are unable to detect the correct version. The remote doesn't seem to have other branches and/or tags.

@raulcd
Copy link
Member

raulcd commented Nov 21, 2024

I've opened an issue because we should find a way to not fail if the dev tag is not present:

@jorisvandenbossche
Copy link
Member

Thanks for investigating that!

So then to resolve this here, @bretttully should fetch the upstream tags and push that to his fork? Something like

git fetch upstream
git push origin --tags

(assuming upstream is apache/arrow and origin is bretttully/arrow)

@bretttully
Copy link
Author

I have merged upstream/main and pushed tags. Let's see if this works...

@raulcd
Copy link
Member

raulcd commented Nov 21, 2024

@github-actions crossbow submit example-python-minimal-build-*

Copy link

Revision: 685167f

Submitted crossbow builds: ursacomputing/crossbow @ actions-524e782c26

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants