-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement 0 dependency streaming multipart/form-data parser #1851
Conversation
Codecov ReportBase: 90.43% // Head: 90.26% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1851 +/- ##
==========================================
- Coverage 90.43% 90.26% -0.17%
==========================================
Files 71 76 +5
Lines 6137 6463 +326
==========================================
+ Hits 5550 5834 +284
- Misses 587 629 +42
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@KhafraDev How does the perf compare? |
Needs benchmarks but probably slower in some areas, same in others. Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc. While this implementation is much... lazier/easier. Rather than doing that, I join chunks together until one of them contains the boundary, and then split the chunks accordingly. Of course without benchmarks, most of that is just speculation of what's faster. |
multipart/form-data is really old and historical. It isn't the best solution for parsing and sending files. I asked if we could at least add you know what would be awesome? having |
I was actually wondering why there wasn't a content-length header -- it's a really poor design overall. What makes websocket frame parsing so much easier is that you know the exact payload length! |
// 1. Let result be the empty string. | ||
let result = '' | ||
|
||
// 2. While position doesn’t point past the end of input and the | ||
// code point at position within input meets the condition condition: | ||
while (position.position < input.length && condition(input[position.position])) { | ||
// 1. Append that code point to the end of result. | ||
result += input[position.position] | ||
if (inputIsString) { | ||
result += input[position.position] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If inputIsString you might skip the while loop and provide a fast path for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't skip the loop entirely, we need to increase the index and check if the character matches the condition
} | ||
} | ||
|
||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would chunk.find be faster? Inlined functions are most of the time faster due to severe v8 optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it, you can't specify a position to start at so I'd have to split the buffer first.
I'm much more interested in replacing the asynchronous parsing with a synchronous parser. Not only is it actually spec compliant, but it will finally cleanup the body consumers. It doesn't make sense to parse a multipart/form-data body asynchronously if it's already in memory. It's also many times slower than Deno/bun (https://twitter.com/jarredsumner/status/1625067963656335361). cc @jimmywarting I know you disagree with this, do you have any counterarguments before I replace busboy with the proposed alternative? |
I don't think a synchronous multipart parser would work for us long term. |
This doesn't seem to be an issue with Deno/bun (or any of the browsers, although their use case is different). Since There is also a proposal to add a limit to the max body size a FormData can have (whatwg/fetch#1592) and it's already an option on undici's Agent. There are alternative apis that undici has that are much better suited for large bodies. The main users of .formData IIRC are library authors who want cross-env compatibility - I wonder how many users are active using it? One more note: it's incredibly slow. https://twitter.com/jarredsumner/status/1625067963656335361 |
There's also the issue of how parsing a multipart/form-data should be parsed and there's no actual spec for it. The WPT coverage sucks as well, while it's really good for every other mixin. Personally I'd like to follow the spec and other platforms who have all implemented the parsing synchronously. See #1694 for proof. |
I don't think we could ever be in agreement on this one. Crashing for out-of-memory issues when receiving files is not safe. Last time I checked, it seemed that Chrome used disk-based Blobs for at least some of those cases. The difference with Node.js has just landed support for disk-baker Blobs, so we will soon be able to use this ;). |
I agree that disk based blob/files would be the best option and that async streaming is best for not hitting the RAM limit. would rather want to have something that's working and runs a bit slower. then rather (as @mcollina puts it) "not ship a feature at all" with the risk for out-of-memory issues. it would also be technically possible to also write all of formdata payload to one single file first and then slice it (virtually - by just changing the offset + size, of where it should start/stop reading from) const blob = fs.openAsBlob(path) // could include all formdata entries
const file1 = new File( [ blob.slice(100, 1024) ], entry.name, { type: entry.mimeType } ) so if you want to write all the data to the disc first and then synchronous search the file after each boundary to lookup where each file entry begin/ends then that could also be an option. currently undici url encoded formdata decoder is synchronous. So i'm wondering is this: https://twitter.com/jarredsumner/status/1625067963656335361 measuring URL encoded payload or multipart formdata encoded payload, cuz it's two completely different path of solving the issue at hand.
if we are going to document the downside of formdata, then we shouldn't provide them with an alternative solution to how they should do it themself using busboy and streams. it should just work as it's intended to work. so if we should document anything at all, then it should be:
|
I agree that we probably won't find middle ground for this 😄.
That's partially the issue - it doesn't work as intended, if by "intended" we're referring to the spec and/or user expectations. There are other slightly less noticeable issues and pretty crappy workarounds we're doing already to match the spec: Lines 461 to 462 in 06f77a9
Anyways, this branch is mostly done, passes all of busboy's tests, passes the same set of WPTs that are enabled. I'll finish up the docs eventually and I think we're good to go? Although I've been holding off because I know there will inevitably be issues that I won't have motivation to fix... Sticking with busboy is also problematic because it doesn't seem well maintained. There's a number of issues and pull requests open. Plus I've already spent many hours working on this lol |
dd02fe5
to
e21de08
Compare
@KhafraDev What's the status on this? Are we using file backed Blob? |
It's mostly done, just need to spent a couple hours adding docs and cleaning stuff up. Is there a way to opt-in to using the file backed Blobs, or is it automatic? |
Perhaps if EDIT i tested but if the response was over a certain amount it would be offloaded it to the disc. i suppose the same logic could be done with FormData, if it encounter a file start reading n bytes, if it's larger than that, then dump it to the file system and pipe the rest of the data to that file. |
@KhafraDev did you give up on this? |
yes, the parsing should be done synchronously to match the spec. |
This PR implements a multipart/form-data parser with a 1:1 match to busboy's API. This took me about a week to complete and many, many hours. I believe docs and types are needed still, but don't feel like adding them right now.
Tests from busboy are included, and every single test works without any modifications!
c8nyc for coverage in busboy tests (busboy uses vanilla node for testing)Bug(s):
fixed in 56052ecpath.basename
returns different values in windows & linux (C:\\files\\1k_b.dat
)reading headers is extremely slowfixed in b7b9645, parsing headers should now be O(n)in certain circumstances FileStream will not emit the 'end' eventfixed in dd02fe5