Skip to content

Commit

Permalink
[SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
Improve docstring of mapInArrow:

- "using a Python native function that takes and outputs a PyArrow's RecordBatch" is confusing cause the function takes and outputs "ITERATOR of RecordBatchs" instead.
- "All columns are passed together as an iterator of pyarrow.RecordBatchs" easily mislead users to think the entire DataFrame will be passed together, "a batch of rows" is used instead.

### Why are the changes needed?
More accurate and clear docstring.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Doc change only.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46088 from xinrong-meng/doc_mapInArrow.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
  • Loading branch information
xinrong-meng committed Apr 17, 2024
1 parent f9ebe1b commit 6c827c1
Showing 1 changed file with 9 additions and 10 deletions.
19 changes: 9 additions & 10 deletions python/pyspark/sql/pandas/map_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,16 +168,14 @@ def mapInArrow(
) -> "DataFrame":
"""
Maps an iterator of batches in the current :class:`DataFrame` using a Python native
function that takes and outputs a PyArrow's `RecordBatch`, and returns the result as a
:class:`DataFrame`.
function that is performed on `pyarrow.RecordBatch`\\s both as input and output,
and returns the result as a :class:`DataFrame`.
The function should take an iterator of `pyarrow.RecordBatch`\\s and return
another iterator of `pyarrow.RecordBatch`\\s. All columns are passed
together as an iterator of `pyarrow.RecordBatch`\\s to the function and the
returned iterator of `pyarrow.RecordBatch`\\s are combined as a :class:`DataFrame`.
Each `pyarrow.RecordBatch` size can be controlled by
`spark.sql.execution.arrow.maxRecordsPerBatch`. The size of the function's input and
output can be different.
This method applies the specified Python function to an iterator of
`pyarrow.RecordBatch`\\s, each representing a batch of rows from the original DataFrame.
The returned iterator of `pyarrow.RecordBatch`\\s are combined as a :class:`DataFrame`.
The size of the function's input and output can be different. Each `pyarrow.RecordBatch`
size can be controlled by `spark.sql.execution.arrow.maxRecordsPerBatch`.
.. versionadded:: 3.3.0
Expand All @@ -190,7 +188,8 @@ def mapInArrow(
the return type of the `func` in PySpark. The value can be either a
:class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.
barrier : bool, optional, default False
Use barrier mode execution.
Use barrier mode execution, ensuring that all Python workers in the stage will be
launched concurrently.
.. versionadded: 3.5.0
Expand Down

0 comments on commit 6c827c1

Please sign in to comment.