Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Reuse vector in LocalPartition #12002

Closed
wants to merge 1 commit into from

Conversation

Yuhta
Copy link
Contributor

@Yuhta Yuhta commented Jan 2, 2025

Summary:
More than 10% of the CPU are spent on the destruction of local partition output when the load is high.

Also add some optimizations for serialization.

Differential Revision: D67742489

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 2, 2025
Copy link

netlify bot commented Jan 2, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0891cb0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6781817ed0a26e00089f2c3b

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67742489

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67742489

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta LGTM and thanks for the optimization % nits. It might be better to remove current_ handling in ByteStream as discussed offline, and it seems to cause tricky bug in the future.

}

void appendBool(bool value, int32_t count);

/// Fast path used by appending one null in vector serialization.
template <bool kValue>
void appendOneBool() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a unit test for this? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the specialization in the new version, just make sure it's inlined should be enough

@@ -411,8 +430,10 @@ class ByteOutputStream {
// The total number of bytes allocated from 'arena_' in 'ranges_'.
int64_t allocatedBytes_{0};

// Pointer to the current element of 'ranges_'.
ByteRange* current_{nullptr};
// Copy of the current element in 'ranges_'. This is copied to avoid memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure if we want to this optimization until we see it cause noticeable regression on actual query. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do see a few percentage (~3%) improvements in the E2E query by removing the 2 extra hops.

velox/common/memory/ByteStream.h Outdated Show resolved Hide resolved
velox/common/memory/ByteStream.h Outdated Show resolved Hide resolved
velox/common/memory/ByteStream.h Outdated Show resolved Hide resolved
velox/exec/LocalPartition.cpp Outdated Show resolved Hide resolved
velox/exec/LocalPartition.cpp Show resolved Hide resolved
velox/exec/LocalPartition.cpp Outdated Show resolved Hide resolved
velox/exec/LocalPartition.cpp Outdated Show resolved Hide resolved
velox/exec/LocalPartition.cpp Outdated Show resolved Hide resolved
Yuhta added a commit to Yuhta/velox that referenced this pull request Jan 10, 2025
Summary:

X-link: facebookincubator/nimble#122

More than 10% of the CPU are spent on the destruction of local partition output when the load is high.

Also add some optimizations for serialization.  Optimization on `ByteOutputStream::appendBool` does not show significant gain in the query in example (because they are a lot small batches), but it is net gain and would be significant in large batches, so I leave it in the code.

Differential Revision: D67742489
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67742489

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta LGTM. thanks!

vectorPool.push(makeVector(), 3);
vectorPool.push(makeVector(), 1);
auto vector = vectorPool.pop();
ASSERT_TRUE(vector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ASSERT_TRUE(vector != nullptr);

ASSERT_TRUE(vector);
ASSERT_EQ(vector.get(), vectors[0]);
vector = vectorPool.pop();
ASSERT_TRUE(vector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Summary:

X-link: facebookincubator/nimble#122

More than 10% of the CPU are spent on the destruction of local partition output when the load is high.

Also add some optimizations for serialization.  Optimization on `ByteOutputStream::appendBool` does not show significant gain in the query in example (because they are a lot small batches), but it is net gain and would be significant in large batches, so I leave it in the code.

Differential Revision: D67742489
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67742489

facebook-github-bot pushed a commit to facebookincubator/nimble that referenced this pull request Jan 10, 2025
Summary:
X-link: facebookincubator/velox#12002

Pull Request resolved: #122

More than 10% of the CPU are spent on the destruction of local partition output when the load is high.

Also add some optimizations for serialization.  Optimization on `ByteOutputStream::appendBool` does not show significant gain in the query in example (because they are a lot small batches), but it is net gain and would be significant in large batches, so I leave it in the code.

Reviewed By: xiaoxmeng

Differential Revision: D67742489

fbshipit-source-id: 8e70dd128f31caa7909ed7c1e2b4ac1e59d7c87d
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9dcfd39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants