-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Reuse vector in LocalPartition #12002
Conversation
✅ Deploy Preview for meta-velox canceled.
|
This pull request was exported from Phabricator. Differential Revision: D67742489 |
232444d
to
d4fbc7e
Compare
This pull request was exported from Phabricator. Differential Revision: D67742489 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta LGTM and thanks for the optimization % nits. It might be better to remove current_ handling in ByteStream as discussed offline, and it seems to cause tricky bug in the future.
velox/common/memory/ByteStream.h
Outdated
} | ||
|
||
void appendBool(bool value, int32_t count); | ||
|
||
/// Fast path used by appending one null in vector serialization. | ||
template <bool kValue> | ||
void appendOneBool() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a unit test for this? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the specialization in the new version, just make sure it's inlined should be enough
velox/common/memory/ByteStream.h
Outdated
@@ -411,8 +430,10 @@ class ByteOutputStream { | |||
// The total number of bytes allocated from 'arena_' in 'ranges_'. | |||
int64_t allocatedBytes_{0}; | |||
|
|||
// Pointer to the current element of 'ranges_'. | |||
ByteRange* current_{nullptr}; | |||
// Copy of the current element in 'ranges_'. This is copied to avoid memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very sure if we want to this optimization until we see it cause noticeable regression on actual query. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do see a few percentage (~3%) improvements in the E2E query by removing the 2 extra hops.
Summary: X-link: facebookincubator/nimble#122 More than 10% of the CPU are spent on the destruction of local partition output when the load is high. Also add some optimizations for serialization. Optimization on `ByteOutputStream::appendBool` does not show significant gain in the query in example (because they are a lot small batches), but it is net gain and would be significant in large batches, so I leave it in the code. Differential Revision: D67742489
d4fbc7e
to
d377ecb
Compare
This pull request was exported from Phabricator. Differential Revision: D67742489 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta LGTM. thanks!
vectorPool.push(makeVector(), 3); | ||
vectorPool.push(makeVector(), 1); | ||
auto vector = vectorPool.pop(); | ||
ASSERT_TRUE(vector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ASSERT_TRUE(vector != nullptr);
ASSERT_TRUE(vector); | ||
ASSERT_EQ(vector.get(), vectors[0]); | ||
vector = vectorPool.pop(); | ||
ASSERT_TRUE(vector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Summary: X-link: facebookincubator/nimble#122 More than 10% of the CPU are spent on the destruction of local partition output when the load is high. Also add some optimizations for serialization. Optimization on `ByteOutputStream::appendBool` does not show significant gain in the query in example (because they are a lot small batches), but it is net gain and would be significant in large batches, so I leave it in the code. Differential Revision: D67742489
d377ecb
to
0891cb0
Compare
This pull request was exported from Phabricator. Differential Revision: D67742489 |
Summary: X-link: facebookincubator/velox#12002 Pull Request resolved: #122 More than 10% of the CPU are spent on the destruction of local partition output when the load is high. Also add some optimizations for serialization. Optimization on `ByteOutputStream::appendBool` does not show significant gain in the query in example (because they are a lot small batches), but it is net gain and would be significant in large batches, so I leave it in the code. Reviewed By: xiaoxmeng Differential Revision: D67742489 fbshipit-source-id: 8e70dd128f31caa7909ed7c1e2b4ac1e59d7c87d
This pull request has been merged in 9dcfd39. |
Summary:
More than 10% of the CPU are spent on the destruction of local partition output when the load is high.
Also add some optimizations for serialization.
Differential Revision: D67742489