Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Skipping slicing on shuffle arrays in shuffle reader #189

Merged
merged 2 commits into from
Mar 11, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 11, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Because we have set shuffle batch size in shuffle writer, we don't need to slice shuffle arrays at shuffle reader now.

What changes are included in this PR?

How are these changes tested?

@viirya viirya changed the title refactor: Skipping slicing on shuffle arrays refactor: Skipping slicing on shuffle arrays in shuffle reader Mar 11, 2024
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have set shuffle batch size in shuffle writer, we don't need to slice shuffle arrays at shuffle reader now.

Do you mean here?

@viirya
Copy link
Member Author

viirya commented Mar 11, 2024

Because we have set shuffle batch size in shuffle writer, we don't need to slice shuffle arrays at shuffle reader now.

Do you mean here?

Yea. Native shuffle writer also set batch size too.

@sunchao
Copy link
Member

sunchao commented Mar 11, 2024

In that particular place, the batch size is configurable through spark.comet.columnar.shuffle.batch.size, which could be different from the batch size in other places. I'm not sure if this works in those cases.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2024

In that particular place, the batch size is configurable through spark.comet.columnar.shuffle.batch.size, which could be different from the batch size in other places. I'm not sure if this works in those cases.

Although it is a separate config, I think it should not be too far from the batch size. Actually I'm thinking if we can just use the batch size as the shuffle batch size to simplify them.

At least, I don't think we will set shuffle batch size larger than batch size (it doesn't make sense to me for the original purpose of the config). For a shuffle batch size smaller than batch size, there is even no reason to do slicing.

@sunchao
Copy link
Member

sunchao commented Mar 11, 2024

Can we at least add some notes for that configuration? there is nothing stop the config from being set to a larger value than the batch size at the moment.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2024

Can we at least add some notes for that configuration? there is nothing stop the config from being set to a larger value than the batch size at the moment.

Okay. That is better.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2024

Added a note to COMET_COLUMNAR_SHUFFLE_BATCH_SIZE.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 25.00000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 33.30%. Comparing base (488c523) to head (7c0f0da).
Report is 2 commits behind head on main.

Files Patch % Lines
.../comet/execution/shuffle/ArrowReaderIterator.scala 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #189      +/-   ##
============================================
+ Coverage     33.29%   33.30%   +0.01%     
- Complexity      766      767       +1     
============================================
  Files           107      107              
  Lines         35385    35372      -13     
  Branches       7658     7657       -1     
============================================
+ Hits          11781    11782       +1     
+ Misses        21157    21144      -13     
+ Partials       2447     2446       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viirya viirya merged commit 4fec40e into apache:main Mar 11, 2024
19 checks passed
@viirya
Copy link
Member Author

viirya commented Mar 11, 2024

Merged. Thanks.

@viirya viirya deleted the remove_jvm_shuffle_slice branch March 11, 2024 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants