-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NATIVECPU][perf] less copies and allocations #2699
base: main
Are you sure you want to change the base?
Conversation
// TODO: avoid calling clear() here. | ||
hKernel->_localArgInfo.clear(); | ||
}); | ||
event->set_callback([event]() { event->tick_end(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the _localArgInfo.clear();
here? Was it a workaround for another issue? Is whatever reason it had for being there being addressed in a different way after your PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why that call was needed, there doesn't seem to be a test that is enabled for it, and I'm not aware of any regression in any of the test pipelines because of this change.
.
I removed it also because it was not in the right place as the callback is called after the kernel was released (which may be wrong), and this just happened to not crash because the kernel pointer still seems to be valid likely due to an issue with reference counting. I've taken a note to investigate.
The call was likely related to hierarchical which doesn't seem to work properly yet anyway on this adapter.
if (groupsPerThread) { | ||
for (unsigned thread = 0; thread < numParallelThreads; thread++) { | ||
futures.emplace_back( | ||
tp.schedule_task([groups, thread, groupsPerThread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're cleaning this up, it doesn't make sense that each task gets its own independent copy of groups
, it causes the lambda to be more expensive than it should be. We should be able to create groups
once, have each lambda maintain a pointer to it, and add its deletion to the cleanup code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known issue and this PR was not going to address this. We are looking into removing the groups
variable altogether, but this would be for subsequent PR.
threadpool.schedule([=](size_t threadId) { (*workerTask)(threadId); }); | ||
template <class T> auto schedule_task(T &&task) { | ||
auto workerTask = | ||
std::make_shared<std::packaged_task<void(size_t)>>(std::move(task)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should forward. If T
is inferred as an lvalue reference, it's important to not move
because the caller isn't going to be expecting it to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is intended to always move and all argument tasks should come through as rvalues. I've added an enable_if
to this function to ensure it cannot be called if T
deduces to an lvalue reference.
@@ -63,6 +63,23 @@ static native_cpu::state getResizedState(const native_cpu::NDRDescT &ndr, | |||
} | |||
#endif | |||
|
|||
class LaunchInfoLocalArgs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming wise, it might be better to have something like LaunchInfoGroup
, that's closer to what it is, I think. The fact that it calls handleLocalArgs()
is less what this class is for and more just something that happens to be needed to be done for each group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed as suggested.
const size_t numParallelThreads; | ||
|
||
public: | ||
LaunchInfoLocalArgs(const native_cpu::state &state_, unsigned g0_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the _
suffixes, there is no conflict between the parameter and field names, I think we don't do that elsewhere either? For instance, in kernel.hpp
:
ur_kernel_handle_t_(ur_program_handle_t hProgram, const char *name,
nativecpu_task_t subhandler)
: hProgram(hProgram), _name{name}, _subhandler{std::move(subhandler)} {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the suffixes. It required prefixing the access to state
with this->
which is probably better for readability.
std::vector<LaunchInfoLocalArgs> groups; | ||
const auto numWG0 = ndr.GlobalSize[0] / ndr.LocalSize[0]; | ||
const auto numWG1 = ndr.GlobalSize[1] / ndr.LocalSize[1]; | ||
const auto numWG2 = ndr.GlobalSize[2] / ndr.LocalSize[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this PR, but do we require global size to be a multiple of local size (for each dimension)? If so, an assert would be useful. If not, this looks suspicious. Either way, no need to change that as part of this PR unless you want to, if changes are needed we can do that in a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a note to investigate in another PR.
This PR removes some unnecessary copies and allocations from kernel launches in the NativeCPU adapter.
Corresponding DPC++ PR: intel/llvm#16990