Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Stop performing deep copies when CometScan source is an exchange #1097

Closed
wants to merge 1 commit into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Nov 19, 2024

Which issue does this PR close?

N/A

Rationale for this change

Because our current Parquet decoding logic reuses mutable buffers, we have to be careful to perform a deep copy before calling operators that could cache data. We do this by wrapping ScanExec in a CopyExec.

However, we use ScanExec both for reading from Parquet scans and also for reading from exchanges (broadcast and shuffle). We only need to perform deep copies in the Parquet case.

Here are the before and after changes:

Before

CopyExec [UnpackOrDeepCopy], metrics=[output_rows=510412, elapsed_compute=432.747µs]
  ScanExec: source=[ShuffleQueryStage (unknown), Statistics(sizeInBytes=223.2 MiB, ...

After

CopyExec [UnpackOrClone], metrics=[output_rows=510412, elapsed_compute=9.006µs]
  ScanExec: source=[ShuffleQueryStage (unknown), Statistics(sizeInBytes=223.2 MiB, ...

What changes are included in this PR?

Use CopyMode::UnpackOrClone instead of CopyMode::UnpackOrDeepCopy when wrapping a CometScan that is reading from an exchange.

How are these changes tested?

Manually.

@andygrove andygrove closed this Dec 2, 2024
@andygrove andygrove deleted the reduce-copies branch January 14, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant