Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should I expect this to support "zero-copy" data loading in some way? #1109

Closed
amn opened this issue Mar 8, 2021 · 2 comments
Closed

Should I expect this to support "zero-copy" data loading in some way? #1109

amn opened this issue Mar 8, 2021 · 2 comments

Comments

@amn
Copy link

amn commented Mar 8, 2021

As someone who wrote a WebAssembly (WASM) module to process data which in practice may wholly be contained in very large files, I stand in front of a problem where the most practical solution would seem to be embracing the Streams API to avoid having the user agent allocate as much memory as an entire [large] file, instead relying on streams providing the data in successive chunks for WASM code to consume. Copying each chunk from the array buffer returned by reading the chunk, to WASM module memory, would, however, seem unavoidable.

Some point at WASM working group being the one to amend its programming model to facilitate efficient data processing in these kind of cases, like allowing WASM modules access multiple WASM memories (on the horizon for WASM, evidently) or allow the script host invoking WASM to juggle such memories in and out of reach of WASM, including outright constructing memory objects out of existing ArrayBuffer buffers and offering these then as memory to WASM code.

One counterargument that has been mentioned is that WASM has requirements with regard to alignment and sizing of its memory objects, which go beyond requirements imposed by the user agent on say, ArrayBuffer objects, making the aforementioned feature requests impractical.

The thing is, I agree with the above counterargument -- I think WASM memory is an object of a class best suited to be controlled from inside the module; after all, relinquishing ownership of its memories brings with it additional complexity for future WASM design and none of the rest of the script host -- meaning JavaScript -- benefits directly, unless it uses WASM.

Where does the Streams API come in here, and why am I bringing this up here?

Well, apart from WASM, it stands to reason that also JavaScript applications would benefit from APIs that use views, as opposed to mandating on returning new ArrayBuffer object every time a data loading operation is done.

Does the Streams API facilitate this -- loading data into views, to save on copy operations? I am not sure, having learned about "BYOB" readers, it would seem these were the solution here, but why can't I do this then:

new Blob([ "foobar" ]).stream().getReader({ mode: "byob" });

Maybe I have understood BYOB in context of streams wrong, but what I think would be beneficial is being able to read data from opaque blobs (among other opaque sources) into much more tangible array buffer that already exists, to save on a future copy operation in the script -- using the read(view) of the obtained reader above would be just the thing, wouldn't it? Except it doesn't work -- apparently streams vended by blobs are not "byte streams". Forgive my ignorance, and the spec may have penetrated too deep into practical application here -- but shouldn't above be a perfect use-case for zero-copy loading of file data into memory available to both the script and any WASM module it may run (which could use Memory.prototype.buffer to make a view on the memory and hand it to a BYOB reader's read call)?

But perhaps Streams API is the wrong API to make changes or additions to, to make scenarios like above, work?

I've read about a dozen issues related to the same "zero copy" umbrella feature request peeking in through the details (zero copy -- an order of magnitude less overhead), but these either focus on WebAssembly -- as if without it there isn't much need to shift to relying on views, where possible -- or appear to chase a rabbit hole of OOP abstractions since around 2014.

What part do you think this specification will play into shifting an entire portfolio of current approaches which create new array buffer for every data loading operation, into something fundamentally relying on views? We don't even have to necessarily consider multi-threading beyond what it already relies upon -- object transfer. If we can transfer the same buffer between threads to make a safe programming model on the Web, I don't see the complications multiple views on the same buffer add to that?

I hope I am making sense with this -- I guess I am frustrated that there are so many APIs that rely on buffers, yet there is next to nothing to avoid excessive, fundamentally unnecessary copying, and neither WebAssembly nor threads appear in my limited understanding to be standing in the way. Yes, we have TypedArray.prototype.set, which is a little gem buried deep in the APIs. The streams API, to my understanding, was motivated by needing better consumption of big data -- to that end, zero copy operations where possible are a continuation of the same direction, so perhaps this is the API to amend?

Of course I might as well ask authors of the File API whether they can add a load(view) method to the Blob class, but I honestly don't know which thread is best to pull. Fixing it in one place probably makes fixing it elsewhere unnecessary and saves on work effort.

@amn amn closed this as completed Mar 8, 2021
@amn
Copy link
Author

amn commented Mar 8, 2021

I figured this was too general to address here and that I should think more about exactly where (which APIs) does the fundamental issue stem from. I don't think streams alone are it, although something Streams API may come up with may help alleviate current concerns.

@MattiasBuelens
Copy link
Collaborator

Your question is absolutely valid, and the Streams API can and should play an important role in this.

Does the Streams API facilitate this -- loading data into views, to save on copy operations? I am not sure, having learned about "BYOB" readers, it would seem these were the solution here, but why can't I do this then:

new Blob([ "foobar" ]).stream().getReader({ mode: "byob" });

The short answer: we're not there yet. Although the specification for readable byte streams has existed for a while, the first implementation has only started shipping very recently with Chrome 89. And right now, they aren't yet integrated into the rest of the Web platform:

Forgive my ignorance, and the spec may have penetrated too deep into practical application here -- but shouldn't above be a perfect use-case for zero-copy loading of file data into memory available to both the script and any WASM module it may run (which could use Memory.prototype.buffer to make a view on the memory and hand it to a BYOB reader's read call)?

I agree that this should work. You should be able to "reserve" a portion of your WASM memory to hold the received data, create a Uint8Array view on that portion and let the readable byte stream write data directly into that view, without needing to pass through the JavaScript heap.

Unfortunately, that doesn't work. ReadableStreamBYOBReader.read(view) transfers the view's backing ArrayBuffer, so that the stream has exclusive access to the buffer while it's being filled. (Eventually, you get back access to the buffer back through the fulfillment value of the read(view) promise.) However, the ArrayBuffer of a WebAssembly.Memory object is not transferable, so you can't actually pass a view that is backed by such a buffer to read(view). 😞

I don't know if there's any intention to make this work, or if it's even possible to support this? I suppose things could get complicated very quickly, for example if the WebAssembly memory needs to grow while a readable byte stream is still read()ing into it...

Right now, the best you can do is allocate a separate ArrayBuffer in JavaScript, read(view) into that buffer and then copy the data to the desired location in your WebAssembly.Memory. This is still better than using "regular" readable stream (with ReadableStreamDefaultReader.read()), since you only allocate once and then re-use the buffer indefinitely for all future calls. With regular readable streams, every read() returns a new Uint8Array with a newly allocated ArrayBuffer. But yes, this is still one copy, not zero copy...

(By the way: if you happen to be using Rust for your "streaming to WebAssembly" use case, you may be interested in wasm-streams. 😉 No support for readable byte streams just yet, but I may have a go at it now that they're available in Chrome. 😄)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants