-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation-specific APIs #2
Comments
This proposal is on the 2021-01-07 call agenda. |
Is it possible to share some example function prototypes of the proposed atomic operation? Given that the proposal calls for an independent functions that are not tied to the graph, it would be informative to understand the nature of the data types that flow through such functions and how it would be consumed by the callers. |
Here's a little more detail, after talking with Ping. A simple JavaScript API could accept WebGL TextureData as input/output tensors. // create op Texture data definition for WebGL export interface TextureData { We can discuss in the next conference call. |
BackgroundAs described in this issue for WebML community group, It aims to define a small set of compute-intensive operations (like convolution 2D and matrix multiplication) that are often the target of hardware acceleration. These are atomic APIs, and would not be tied to a graph or model loader implementation. Since these APIs are not targeting for full graph execution, they are useful for Javascript ML frameworks like TensorFlow.js, allowing them to access OS or hard-level acceleration that can not be achieved through the existing Web APIs (WebAssembly, WebGL or WebGPU). The typical way for frameworks to utilize the op level API to use some kind of delegation mechanism. The ML framework will be responsible for loading and interpreting the ML model. During the model execution phase, the model executor will iterate through the ops and poll all backends for each op. The backends that support the op will be sorted by their priority, the one with the highest priority will be selected to execute the op. This mechanism will maximize the amount of overall supported ops, but it will face data IO efficiency issues when switching from one backend to the other. The latency of transfer data from one backend to the other could potentially outweigh the acceleration benefit provided by the op API. In reality, users construct data pipelines for ML related tasks, model execution is one of the steps. For example in an AR task (lipstick try-on), the data pipeline would contain many pre/post processing steps, which typically are taking place on the GPU. Ensure that model execution also happens on the GPU to avoid unnecessary copying between GPU and CPU. SolutionGraph level API is one way to avoid the IO bottleneck through trying to execute the model graph as large as possible with a single backend. The other way is to lock the op API to a single accelerator (CPU or GPU), in order to reduce IO transferring between accelerators. Most Javascript ML frameworks try to access hardware acceleration through existing Web APIs , for example WebAssembly for CPU accelerations (SIMD, multi-threading), WebGL or WebGPU for GPU accelerations. For example TensorFlow.js has three backends (WebGL, WebAssembly and WebGPU), models are typically executed within one of the backends. If WebML op level API targets existing Web API for data binding mechanism, it can be easily incorporated into different backends that Javascript ML platform provides. CPU - WebAssemblyFor example, in WebAssembly, a Tensor can be represented as following:
The WebML op API will need to provide C++ header file that exposes kernel compilation and execution APIs.
It would likely need to copy the data in and out of the WebAssembly memory heap to place where the acceleration happens (i.e. Intel SIMD instruction in OpenVINO). Since we are only targeting CPU acceleration with WebAssembly, the data transfer is bound within the CPU. Even with the overhead added, the performance should still be much faster than pure WebAssembly implementation. This would be similar to how currently TensorFlow.js utilizes the XNNPack library from TFLite in WebAssembly. XNNPAck provides CPU acceleration utilizing WebAssembly SIMD for around 20 kernels. GPUWhen the data is already on the GPU, the Tensor data could be stored as texture for WebGL or memory buffer For WebGPU. Supporting these input types could be valuable. Texture data definition for WebGL
WebML op API would provide a JS API that accepts WebGL TextureData similar to above as input/output tensors.
Drawbacks
|
@pyu10055 , thanks much for sharing the details. It is very helpful to understand the op delegation mechanism of JavaScript ML frameworks, like TensorFlow.js. I suppose WebNN is able to support this mechanism by single-op graphs. And thanks to the support of pre-allocated buffers, the memories represented by async function webml_create_conv2d(input_tensor_info, filter_tensor_info, output_tensor_info, params) {
const nn = navigator.ml.getNeuralNetworkContext();
const builder = nn.createModelBuilder();
const input = builder.input('input', {type: 'float32', dimensions: input_tensor_info.shape});
const filter = builder.constant({type: 'float32', dimensions: filter_tensor_info.shape}, filter_tensor_info.f32);
const output = builder.conv2d(input, filter, params);
const op = builder.createModel({output});
return {
type: 'conv2d',
compiledOp: await op.compile(),
inputs: {'input': {buffer: input_tensor_info.f32}},
outputs: {'output': {buffer: output_tensor_info.f32_write}}
};
}
async function webml_run_kernel(op) {
// op.type === 'conv2d'
await op.compiledOp.compute(op.inputs, op.outputs);
}
// Emulate the heap of WebAssembly code
const heap = new WebAssembly.Memory({initial: 1}).buffer;
// Emulate the `TensorInfo` struct by JS object:
// - add `shape` field that describes the tensor shape
// - only support `f32` and `f32_write` for sake of simplicity
const input_tensor_info = {'shape': [1, 1, 5, 5], 'f32': new Float32Array(heap, 0, 25), 'f32_write': new Float32Array(heap, 0, 25)};
const filter_tensor_info = {'shape': [1, 1, 3, 3], 'f32': new Float32Array(heap, 25 * 4, 9), 'f32_write': new Float32Array(heap, 25 * 4, 9)};
const output_tensor_info = {'shape': [1, 1, 3, 3], 'f32': new Float32Array(heap, 34 * 4, 9), 'f32_write': new Float32Array(heap, 34 * 4, 9)};
// create op
filter_tensor_info.f32_write.fill(1);
const conv2d_op = await webml_create_conv2d(input_tensor_info, filter_tensor_info, output_tensor_info);
// execute the op
input_tensor_info.f32_write.fill(1);
await webml_run_kernel(conv2d_op);
console.log(`output values: ${output_tensor_info.f32}`);
// output values: 9,9,9,9,9,9,9,9,9
// execute the op with different input
input_tensor_info.f32_write.fill(2);
await webml_run_kernel(conv2d_op);
console.log(`output values: ${output_tensor_info.f32}`);
// output values: 18,18,18,18,18,18,18,18,18 You can copy and paste above sample code into WebNN code editor and try. Please click Edit button before pasting. Probably, we could incorporate the op delegation mechanism into WebNN framework use cases and explainer, and support it better with some enhancements, for example:
Actually, 2 and 3 are not specific for the single-op graph execution, they would also benefit for the multi-ops graph execution. Any other thoughts? @wchao1115 @anssiko |
Thanks @jbingham and @pyu10055 for bringing forward this discussion. It touches on a few issues, which I'll try to summarize it here:
I think a key question is whether we think addressing these issues would warrant defining a new set of API altogether. As @huningxin pointed out in his reply, #1 can be addressed simply by allowing the framework direct access to the WebNN convolution operation, and #2 by extending the WebNN API to support native tensor data types, something we would need to consider anyway regardless of this discussion in order to avoid excessive copying in high-bandwidth visual scenarios. Arguably issue #3 is a framework's policy, which may vary among different framework implementations. But by defining WebNN as a graph API, we implicitly influence a single-device design from the get-go, thus side-stepping the issue altogether. We adopted the same mentality when we design the DirectML graph API to allow for additional graph transforms, to avoid device stalling ,and to reduce internal data transfers. What I want to add here is that we might also need to consider eager execution in WebNN. That way a graph's compile step could also become optional. |
Thanks @wchao1115 for a great summary.
I'd like to add that the performance critical tasks may involve some kind of operation fusion. For example, the convolution + bias add + activation, are normally fused for performance optimization in native ML APIs, e.g. Convolutinon of oneDNN. The fused ops are also used by JS ML frameworks, e.g. FusedConv2d op of TensorFlow.js. For such kind of fused ops, the JS ML frameworks may still need to create a small webnn graph that wires conv2d, element-wise add and one of the activations, so the implementation could compile that graph into a fused native convolution op. |
In Operation-specific APIs discussion on 18 March 2021 call, we agreed to address the requirements laid out in this proposal in the WebNN API. @huningxin, @wchao1115 and @RafaelCintron will open issues in the webnn repo to track remaining work identified. First changes landed in webmachinelearning/webnn#149 already. We'll keep this issue open until the requirements have been satisfied. Thanks @pyu10055 and @jbingham for explaining this important use case. @huningxin feel free to open an issue to update the WebNN API framework use cases accordingly. |
Done. webmachinelearning/webnn#154. @pyu10055 , @jbingham @wchao1115 please take a look. Thanks! |
I opened webmachinelearning/webnn#157 in response to @RafaelCintron's comment on our call. @wchao1115 @pyu10055 please check all requirements derived from this issue have a corresponding webnn issue: https://github.com/webmachinelearning/webnn/issues |
Operation-specific APIs
This is a proposal to define and implement a small number of standalone APIs for individual compute-intensive operations (like convolution 2D and matrix multiplication) that are often the target of hardware acceleration. The APIs would be atomic, and would not be tied to a graph or model loader implementation. It would be up to javascript libraries or WASM to call into these low-level APIs.
Short description
Across many common machine learning models, there are a handful of compute-intensive operations that may account for 90-99% of inference time, based on the benchmarking done for Web NN. If these few operations were offered as standalone APIs, hardware acceleration could give much of the performance benefit with a small simple API surface, without needing to define all of the many other instructions and graph topology needed for a higher-level API like a graph or model loader. As a benefit, it ought to be faster to get this handful of APIs shipped.
JavaScript ML libraries would need to be updated to take advantage of the APIs, just like they can take advantage of Web GL today.
Example use cases
Image classification typically uses convolution and matrix multiplication. With hardware accelerated versions of these two operations, the performance boost would be close to the optimal that could be achieved with a complete graph or model execution API.
A rough idea or two about implementation
Maybe the closest example is Web GL compute shaders, except that these operations would be much simpler.
The text was updated successfully, but these errors were encountered: