Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NATIVECPU][perf] less copies and allocations #2699

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

uwedolinsky
Copy link
Contributor

@uwedolinsky uwedolinsky commented Feb 12, 2025

This PR removes some unnecessary copies and allocations from kernel launches in the NativeCPU adapter.

Corresponding DPC++ PR: intel/llvm#16990

@uwedolinsky uwedolinsky requested a review from a team as a code owner February 12, 2025 17:16
@github-actions github-actions bot added the native-cpu Native CPU adapter specific issues label Feb 12, 2025
// TODO: avoid calling clear() here.
hKernel->_localArgInfo.clear();
});
event->set_callback([event]() { event->tick_end(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the _localArgInfo.clear(); here? Was it a workaround for another issue? Is whatever reason it had for being there being addressed in a different way after your PR?

Copy link
Contributor Author

@uwedolinsky uwedolinsky Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why that call was needed, there doesn't seem to be a test that is enabled for it, and I'm not aware of any regression in any of the test pipelines because of this change.
.
I removed it also because it was not in the right place as the callback is called after the kernel was released (which may be wrong), and this just happened to not crash because the kernel pointer still seems to be valid likely due to an issue with reference counting. I've taken a note to investigate.

The call was likely related to hierarchical which doesn't seem to work properly yet anyway on this adapter.

if (groupsPerThread) {
for (unsigned thread = 0; thread < numParallelThreads; thread++) {
futures.emplace_back(
tp.schedule_task([groups, thread, groupsPerThread,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're cleaning this up, it doesn't make sense that each task gets its own independent copy of groups, it causes the lambda to be more expensive than it should be. We should be able to create groups once, have each lambda maintain a pointer to it, and add its deletion to the cleanup code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a known issue and this PR was not going to address this. We are looking into removing the groups variable altogether, but this would be for subsequent PR.

threadpool.schedule([=](size_t threadId) { (*workerTask)(threadId); });
template <class T> auto schedule_task(T &&task) {
auto workerTask =
std::make_shared<std::packaged_task<void(size_t)>>(std::move(task));
Copy link
Contributor

@hvdijk hvdijk Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should forward. If T is inferred as an lvalue reference, it's important to not move because the caller isn't going to be expecting it to be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is intended to always move and all argument tasks should come through as rvalues. I've added an enable_if to this function to ensure it cannot be called if T deduces to an lvalue reference.

@@ -63,6 +63,23 @@ static native_cpu::state getResizedState(const native_cpu::NDRDescT &ndr,
}
#endif

class LaunchInfoLocalArgs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming wise, it might be better to have something like LaunchInfoGroup, that's closer to what it is, I think. The fact that it calls handleLocalArgs() is less what this class is for and more just something that happens to be needed to be done for each group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as suggested.

const size_t numParallelThreads;

public:
LaunchInfoLocalArgs(const native_cpu::state &state_, unsigned g0_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the _ suffixes, there is no conflict between the parameter and field names, I think we don't do that elsewhere either? For instance, in kernel.hpp:

  ur_kernel_handle_t_(ur_program_handle_t hProgram, const char *name,
                      nativecpu_task_t subhandler)
      : hProgram(hProgram), _name{name}, _subhandler{std::move(subhandler)} {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the suffixes. It required prefixing the access to state with this-> which is probably better for readability.

std::vector<LaunchInfoLocalArgs> groups;
const auto numWG0 = ndr.GlobalSize[0] / ndr.LocalSize[0];
const auto numWG1 = ndr.GlobalSize[1] / ndr.LocalSize[1];
const auto numWG2 = ndr.GlobalSize[2] / ndr.LocalSize[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, but do we require global size to be a multiple of local size (for each dimension)? If so, an assert would be useful. If not, this looks suspicious. Either way, no need to change that as part of this PR unless you want to, if changes are needed we can do that in a new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a note to investigate in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
native-cpu Native CPU adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants