Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle write time metric is wrong #5731

Closed
hahazyb201 opened this issue May 13, 2024 · 4 comments · Fixed by #5742
Closed

Shuffle write time metric is wrong #5731

hahazyb201 opened this issue May 13, 2024 · 4 comments · Fixed by #5742
Labels
bug Something isn't working triage

Comments

@hahazyb201
Copy link
Contributor

Backend

VL (Velox)

Bug description

In the DAG, when I observe the "shuffle write time total" metric, I found it was much bigger than I expected. So I dive deep into the gluten code and found that the writeTime_ was added twice into the final metric by writeMetrics.incWriteTime.
截屏2024-05-13 17 54 33

In the VeloxCelebornHashBasedColumnarShuffleWriter.scala file, write time was calculated as the sum of splitResult.getTotalWriteTime + splitResult.getTotalPushTime. And the totalWriteTime is accumulated here by this line . The totalPushTime is accumulated here by the spillTime_ variable. And it's obvious that the spillTime_ includes writeTime_ which means writeTime_ was added twice in the final write time metric.

In order to fix it, I propose moving the ScopedTimer line a few lines down.
截屏2024-05-13 19 07 51

Let me know if you want me to open a PR. Thanks.

Spark version

Spark-3.2.x

Spark configurations

No response

System information

No response

Relevant logs

No response

@hahazyb201 hahazyb201 added bug Something isn't working triage labels May 13, 2024
@Yohahaha
Copy link
Contributor

RssPartitionWriter::stop does not count Payload#writeTime_, so I think write time does not count twice in celeborn.

@hahazyb201
Copy link
Contributor Author

Hi, I think Payload#writeTime_ is actually counted in RssPartitionWriter::stop at this line. And writeTime_ is introduced by line 20 "#include "shuffle/Payload.h"" . So it would be counted twice.

@Yohahaha
Copy link
Contributor

cc @kerwin-zk could you double check that?

@kerwin-zk
Copy link
Contributor

@hahazyb201 You can open a pr. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants