Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-blocking clEnqueueWriteBuffer does not work in the runtime #204

Open
Bastacyclop opened this issue Aug 20, 2021 · 6 comments
Open

Non-blocking clEnqueueWriteBuffer does not work in the runtime #204

Bastacyclop opened this issue Aug 20, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@Bastacyclop
Copy link
Member

In the "one copy" buffer runtime, deviceBufferSync uses clEnqueueWriteBuffer.
I would have thought that we could use a non-blocking call there, but that produces bugs (e.g. in #203, output is fixed by changing to a blocking call 3eb189a).
The idea was that before accessing the ptr on the host, there should be a blocking call to clEnqueueReadBuffer via hostBufferSync (which should be ordered after the clEnqueueWriteBuffer and most likely a kernel call on the command queue).

Any idea why the non-blocking call does not work and how it could be fixed? @fedepiz @michel-steuwer

The code where things go wrong:

#include "ocl/ocl.h"
struct foo_t {
  Kernel k0;
};

typedef struct foo_t foo_t;

void foo_init(Context ctx, foo_t* self){
  (*self).k0 = loadKernel(ctx, k0);
}

void foo_destroy(Context ctx, foo_t* self){
  destroyKernel(ctx, (*self).k0);
}

void foo_run(Context ctx, foo_t* self, Buffer moutput, int n1, Buffer me2){
  {
    Buffer mx101 = createBuffer(ctx, n1 * sizeof(int32_t), HOST_WRITE | DEVICE_READ);
    {
      int32_t* x101 = (int32_t*)hostBufferSync(ctx, mx101, n1 * sizeof(int32_t), HOST_WRITE);
      int32_t* e2 = (int32_t*)hostBufferSync(ctx, me2, n1 * sizeof(int32_t), HOST_READ);
      /* mapSeq */
      for (int i_110 = 0; i_110 < n1; i_110 = 1 + i_110) {
        x101[i_110] = ((int32_t)2) + e2[i_110];
      }
      
    }
    
    {
      DeviceBuffer b0 = deviceBufferSync(ctx, moutput, n1 * sizeof(int32_t), DEVICE_WRITE);
      DeviceBuffer b2 = deviceBufferSync(ctx, mx101, n1 * sizeof(int32_t), DEVICE_READ);
      const size_t global_size[3] = (const size_t[3]){n1 / 2, 1, 1};
      const size_t local_size[3] = (const size_t[3]){2, 1, 1};
      const KernelArg args[3] = (const KernelArg[3]){KARG(b0), KARG(n1), KARG(b2)};
      launchKernel(ctx, (*self).k0, global_size, local_size, 3, args);
    }
    
    destroyBuffer(ctx, mx101);
  }
  
}

void foo_init_run(Context ctx, Buffer moutput, int n1, Buffer me2){
  foo_t foo;
  foo_init(ctx, &foo);
  foo_run(ctx, &foo, moutput, n1, me2);
  foo_destroy(ctx, &foo);
}

const int N = 64;
int main(int argc, char** argv) {
  Context ctx = createDefaultContext();
  Buffer input = createBuffer(ctx, N * sizeof(int32_t), HOST_READ | HOST_WRITE | DEVICE_READ);
  Buffer output = createBuffer(ctx, N * sizeof(int32_t), HOST_READ | HOST_WRITE | DEVICE_WRITE);

  int32_t* in = hostBufferSync(ctx, input, N * sizeof(int32_t), HOST_WRITE);
  for (int i = 0; i < N; i++) {
    in[i] = 0;
  }

  foo_init_run(ctx, output, N, input);

  int32_t* out = hostBufferSync(ctx, output, N * sizeof(int32_t), HOST_READ);

  for (int i = 0; i < N; i++) {
    if (out[i] != 3) {
      fprintf(stderr, "wrong output: %i\n", out[i]);
      exit(EXIT_FAILURE);
    }
  }

  destroyBuffer(ctx, input);
  destroyBuffer(ctx, output);
  destroyContext(ctx);
  return EXIT_SUCCESS;
}
@Bastacyclop Bastacyclop added the bug Something isn't working label Aug 20, 2021
@Bastacyclop
Copy link
Member Author

buffer_one_copy.c

@fedepiz
Copy link
Contributor

fedepiz commented Aug 20, 2021

I suspect this may be because the clEnqueueWriteBuffer may not be blocking for the device as well as the host. So even enqueing
WRITE -> KERNEL -> READ_BLOCKING does not mean that KERNEL starts after WRITE ends.
I suspect what we want is rather
WRITE -> WAIT(event from the write) -> KERNEL -> WAIT(event from the kernel) -> READ_BLOCKING

In general, the specification claims that the application cannot use the target pointer of a non-blocking write before waiting on the event. I originally took it to only be relevant for host-pointers, but the spec doesn't specify this. It jus says "the application" and "the memory pointed"

Link to relevant docs:
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueWriteBuffer.html

@Bastacyclop
Copy link
Member Author

@michel-steuwer
Copy link
Member

If I understand your speculations here correctly, you think there might be an issue with the synchronization of the memory operation and the kernel execution.

If you are using different command queues for memory operations and kernels, then you must use events to synchronize them. If you use only a single command queue for both (and you configure it to be in order), then every command will wait until all prior commands have finished before execution.

@fedepiz
Copy link
Contributor

fedepiz commented Aug 20, 2021

Mh. I am getting contradictory statements from the documentation.
One one hand, the https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html docs agree with what you say

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.@

On the other hand, the documentations for https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueWriteBuffer.html states prosaically

If blocking_write is CL_FALSE, the OpenCL implementation will use ptr to perform a nonblocking write. As the write is non-blocking the implementation can return immediately. The memory pointed to by ptr cannot be reused by the application after the call returns. The event argument returns an event object which can be used to query the execution status of the write command. When the write command has completed, the memory pointed to by ptr can then be reused by the application.

Could it be that for kernel launches an in-order queue guarantees sequential execution, but the same is not true for a non-blocking clEnqueueWriteBuffer?

I think I may experiment on the side and see

EDIT:

A bit later, the docs on the command queue state

Similarly, commands to read, write, copy or map memory objects that are enqueued after clEnqueueNDRangeKernel, clEnqueueTask or clEnqueueNativeKernel commands are not guaranteed to wait for kernels scheduled for execution to have completed (if the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property is set). To ensure correct ordering of commands, the event object returned by clEnqueueNDRangeKernel, clEnqueueTask or clEnqueueNativeKernel can be used to enqueue a wait for event or a barrier command can be enqueued that must complete before reads or writes to the memory object(s) occur.

Which is a bit vague (strictly speaking, it doesn't't tell me what happens when OUT_OF_ORDER is disabled, or what happens if the events are BEFORE the kernel, as it said "after" the enqueuing of a kernel) but seems to imply @michel-steuwer a bit more. If that's the case, I am really not sure.

@Bastacyclop
Copy link
Member Author

Bastacyclop commented Aug 23, 2021

@fedepiz @michel-steuwer Here is a log of the runtime and associated OpenCL calls for this example:

hostBufferSync:    (none)                      // for input write
// host code: in[i] = 0;
hostBufferSync:    (none)                      // for temporary write
hostBufferSync:    (none)                      // for input read
// host code: x101[i] = 2 + e2[i];
deviceBufferSync:  (none)                      // for output write
deviceBufferSync:  clEnqueueWriteBuffer        // for temporary read
launchKernel:      clEnqueueNDRangeKernel
hostBufferSync:    clEnqueueReadBuffer         // for output read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants