Shuffle write time metric is wrong #5731

hahazyb201 · 2024-05-13T11:13:17Z

Backend

VL (Velox)

Bug description

In the DAG, when I observe the "shuffle write time total" metric, I found it was much bigger than I expected. So I dive deep into the gluten code and found that the writeTime_ was added twice into the final metric by writeMetrics.incWriteTime.

In the VeloxCelebornHashBasedColumnarShuffleWriter.scala file, write time was calculated as the sum of splitResult.getTotalWriteTime + splitResult.getTotalPushTime. And the totalWriteTime is accumulated here by this line . The totalPushTime is accumulated here by the spillTime_ variable. And it's obvious that the spillTime_ includes writeTime_ which means writeTime_ was added twice in the final write time metric.

In order to fix it, I propose moving the ScopedTimer line a few lines down.

Let me know if you want me to open a PR. Thanks.

Spark version

Spark-3.2.x

Spark configurations

No response

System information

No response

Relevant logs

No response

Yohahaha · 2024-05-13T14:53:30Z

RssPartitionWriter::stop does not count Payload#writeTime_, so I think write time does not count twice in celeborn.

hahazyb201 · 2024-05-14T03:00:08Z

Hi, I think Payload#writeTime_ is actually counted in RssPartitionWriter::stop at this line. And writeTime_ is introduced by line 20 "#include "shuffle/Payload.h"" . So it would be counted twice.

Yohahaha · 2024-05-14T06:09:17Z

cc @kerwin-zk could you double check that?

kerwin-zk · 2024-05-14T06:45:07Z

@hahazyb201 You can open a pr. Thanks!

hahazyb201 added bug Something isn't working triage labels May 13, 2024

hahazyb201 mentioned this issue May 14, 2024

[GLUTEN-5731][CORE] Fix the logic to calculate rss shuffle write time #5742

Merged

marin-ma closed this as completed in #5742 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle write time metric is wrong #5731

Shuffle write time metric is wrong #5731

hahazyb201 commented May 13, 2024

Yohahaha commented May 13, 2024

hahazyb201 commented May 14, 2024

Yohahaha commented May 14, 2024

kerwin-zk commented May 14, 2024

Shuffle write time metric is wrong #5731

Shuffle write time metric is wrong #5731

Comments

hahazyb201 commented May 13, 2024

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

Yohahaha commented May 13, 2024

hahazyb201 commented May 14, 2024

Yohahaha commented May 14, 2024

kerwin-zk commented May 14, 2024