[NATIVECPU][perf] less copies and allocations #2699

uwedolinsky · 2025-02-12T17:16:21Z

This PR removes some unnecessary copies and allocations from kernel launches in the NativeCPU adapter.

Corresponding DPC++ PR: intel/llvm#16990

…vecpu_perf

hvdijk · 2025-02-12T18:05:11Z

source/adapters/native_cpu/enqueue.cpp

-    // TODO: avoid calling clear() here.
-    hKernel->_localArgInfo.clear();
-  });
+  event->set_callback([event]() { event->tick_end(); });


Why was the _localArgInfo.clear(); here? Was it a workaround for another issue? Is whatever reason it had for being there being addressed in a different way after your PR?

It's not clear why that call was needed, there doesn't seem to be a test that is enabled for it, and I'm not aware of any regression in any of the test pipelines because of this change.
.
I removed it also because it was not in the right place as the callback is called after the kernel was released (which may be wrong), and this just happened to not crash because the kernel pointer still seems to be valid likely due to an issue with reference counting. I've taken a note to investigate.

The call was likely related to hierarchical which doesn't seem to work properly yet anyway on this adapter.

hvdijk · 2025-02-12T18:07:16Z

source/adapters/native_cpu/enqueue.cpp

+      if (groupsPerThread) {
+        for (unsigned thread = 0; thread < numParallelThreads; thread++) {
+          futures.emplace_back(
+              tp.schedule_task([groups, thread, groupsPerThread,


While we're cleaning this up, it doesn't make sense that each task gets its own independent copy of groups, it causes the lambda to be more expensive than it should be. We should be able to create groups once, have each lambda maintain a pointer to it, and add its deletion to the cleanup code.

This is a known issue and this PR was not going to address this. We are looking into removing the groups variable altogether, but this would be for subsequent PR.

source/adapters/native_cpu/threadpool.hpp

hvdijk · 2025-02-12T18:15:47Z

source/adapters/native_cpu/threadpool.hpp

-    threadpool.schedule([=](size_t threadId) { (*workerTask)(threadId); });
+  template <class T> auto schedule_task(T &&task) {
+    auto workerTask =
+        std::make_shared<std::packaged_task<void(size_t)>>(std::move(task));


This should forward. If T is inferred as an lvalue reference, it's important to not move because the caller isn't going to be expecting it to be changed.

No, this is intended to always move and all argument tasks should come through as rvalues. I've added an enable_if to this function to ensure it cannot be called if T deduces to an lvalue reference.

hvdijk · 2025-02-12T18:38:03Z

source/adapters/native_cpu/enqueue.cpp

@@ -63,6 +63,23 @@ static native_cpu::state getResizedState(const native_cpu::NDRDescT &ndr,
 }
 #endif

+class LaunchInfoLocalArgs {


Naming wise, it might be better to have something like LaunchInfoGroup, that's closer to what it is, I think. The fact that it calls handleLocalArgs() is less what this class is for and more just something that happens to be needed to be done for each group.

Renamed as suggested.

hvdijk · 2025-02-12T18:40:22Z

source/adapters/native_cpu/enqueue.cpp

+  const size_t numParallelThreads;
+
+public:
+  LaunchInfoLocalArgs(const native_cpu::state &state_, unsigned g0_,


No need for the _ suffixes, there is no conflict between the parameter and field names, I think we don't do that elsewhere either? For instance, in kernel.hpp:

ur_kernel_handle_t_(ur_program_handle_t hProgram, const char *name, nativecpu_task_t subhandler) : hProgram(hProgram), _name{name}, _subhandler{std::move(subhandler)} {}

I've removed the suffixes. It required prefixing the access to state with this-> which is probably better for readability.

hvdijk · 2025-02-12T18:42:44Z

source/adapters/native_cpu/enqueue.cpp

+  std::vector<LaunchInfoLocalArgs> groups;
+  const auto numWG0 = ndr.GlobalSize[0] / ndr.LocalSize[0];
+  const auto numWG1 = ndr.GlobalSize[1] / ndr.LocalSize[1];
+  const auto numWG2 = ndr.GlobalSize[2] / ndr.LocalSize[2];


Not directly related to this PR, but do we require global size to be a multiple of local size (for each dimension)? If so, an assert would be useful. If not, this looks suspicious. Either way, no need to change that as part of this PR unless you want to, if changes are needed we can do that in a new PR.

I've taken a note to investigate in another PR.

uwedolinsky added 16 commits January 28, 2025 19:55

[NATIVECPU] remove some std::function usage

c7636df

[NATIVECPU] added LaunchInfo struct to reduced allocations

0e464f1

[NATIVECPU] file mode fix

44adfdd

[NATIVECPU] removed MS extension

6b370d8

[NATIVECPU] removed std::function from threadpool

a27648f

[NATIVECPU] don't enqeue lambdas without work

44bf01a

[NATIVECPU] removed lambda

bc68659

[NATIVECPU] removed _localArgInfo.clear

696733b

[NATIVECPU] replaced copy with move

8928164

[NATIVECPU] clang-format

c692aa2

[NATIVECPU] some const variables

97f5041

Merge commit '9d6542bac16db69d907002cb0f8c724c7a50d2dc' into uwe/nati…

b6656f4

…vecpu_perf

[NATIVECPU] added separate if statement for clarity

c5dc020

[NATIVECPU] clang-format

23436e8

Merge commit '064d3560cbcc24a40380b57adb3690b2bc256d6e' into uwe/nati…

9a08de3

…vecpu_perf

[NATIVECPU] removed old comment and unneeded capture

a4bda02

uwedolinsky requested a review from a team as a code owner February 12, 2025 17:16

github-actions bot added the native-cpu Native CPU adapter specific issues label Feb 12, 2025

[NATIVECPU] clang-format

68eb4e6

uwedolinsky mentioned this pull request Feb 12, 2025

[SYCL][NATIVECPU] less copies and allocations intel/llvm#16990

Open

hvdijk reviewed Feb 12, 2025

View reviewed changes

source/adapters/native_cpu/threadpool.hpp Outdated Show resolved Hide resolved

hvdijk reviewed Feb 12, 2025

View reviewed changes

uwedolinsky added 3 commits February 12, 2025 19:10

[NATIVECPU] variable naming

c0463b1

[NATIVECPU] variable naming

58bb64e

[NATIVECPU] removed shared_ptr, enforce always moving

a4c8d4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NATIVECPU][perf] less copies and allocations #2699

[NATIVECPU][perf] less copies and allocations #2699

uwedolinsky commented Feb 12, 2025 •

edited

Loading

hvdijk Feb 12, 2025

uwedolinsky Feb 13, 2025 •

edited

Loading

hvdijk Feb 12, 2025

uwedolinsky Feb 12, 2025

hvdijk Feb 12, 2025 •

edited

Loading

uwedolinsky Feb 13, 2025

hvdijk Feb 12, 2025

uwedolinsky Feb 13, 2025

hvdijk Feb 12, 2025

uwedolinsky Feb 13, 2025

hvdijk Feb 12, 2025

uwedolinsky Feb 13, 2025

[NATIVECPU][perf] less copies and allocations #2699

Are you sure you want to change the base?

[NATIVECPU][perf] less copies and allocations #2699

Conversation

uwedolinsky commented Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

uwedolinsky Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvdijk Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uwedolinsky commented Feb 12, 2025 •

edited

Loading

uwedolinsky Feb 13, 2025 •

edited

Loading

hvdijk Feb 12, 2025 •

edited

Loading