feat: implement 0 dependency streaming multipart/form-data parser #1851

KhafraDev · 2023-01-09T22:58:52Z

This PR implements a multipart/form-data parser with a 1:1 match to busboy's API. This took me about a week to complete and many, many hours. I believe docs and types are needed still, but don't feel like adding them right now.

Tests from busboy are included, and every single test works without any modifications!

Bug(s):

~~path.basename returns different values in windows & linux (C:\\files\\1k_b.dat)~~ fixed in 56052ec
~~reading headers is extremely slow~~ fixed in b7b9645, parsing headers should now be O(n)
~~in certain circumstances FileStream will not emit the 'end' event~~ fixed in dd02fe5

codecov-commenter · 2023-01-09T23:03:43Z

Codecov Report

Base: 90.43% // Head: 90.26% // Decreases project coverage by -0.17% ⚠️

Coverage data is based on head (e21de08) compared to base (06f77a9).
Patch coverage: 87.53% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1851      +/-   ##
==========================================
- Coverage   90.43%   90.26%   -0.17%     
==========================================
  Files          71       76       +5     
  Lines        6137     6463     +326     
==========================================
+ Hits         5550     5834     +284     
- Misses        587      629      +42

Impacted Files	Coverage Δ
lib/formdata/util.js	`85.71% <85.71%> (ø)`
lib/formdata/parser.js	`86.07% <86.07%> (ø)`
lib/formdata/textsearch.js	`93.75% <93.75%> (ø)`
index.js	`99.00% <100.00%> (+0.02%)`	⬆️
lib/fetch/body.js	`97.15% <100.00%> (ø)`
lib/fetch/dataURL.js	`88.46% <100.00%> (+0.38%)`	⬆️
lib/formdata/constants.js	`100.00% <100.00%> (ø)`
lib/formdata/filestream.js	`100.00% <100.00%> (ø)`
lib/fetch/file.js	`89.65% <0.00%> (-1.15%)`	⬇️
lib/fetch/index.js	`84.93% <0.00%> (+0.18%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

kibertoad · 2023-01-10T12:25:24Z

@KhafraDev How does the perf compare?

KhafraDev · 2023-01-10T15:05:53Z

Needs benchmarks but probably slower in some areas, same in others. Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc. While this implementation is much... lazier/easier. Rather than doing that, I join chunks together until one of them contains the boundary, and then split the chunks accordingly. Of course without benchmarks, most of that is just speculation of what's faster.

jimmywarting · 2023-01-10T15:53:43Z

Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc

multipart/form-data is really old and historical. It isn't the best solution for parsing and sending files.
Something that would be better is to have something more like structural data that tells how large a field is in size so that you can read x amount of data instead of scanning a boundary.

I asked if we could at least add content-length to each and every field. (not just the total body size). that would make boundary a bit more obsolete. but it got very idle.

you know what would be awesome? having cbor support baked into fetch! await new Resoponse(data).cbor()
it would totally kick out formdata, json, text and blobs out of existence by having more structural data support. it would more or less support the same things that you could do with an object and structuralClone on. (just wishful thinking)

KhafraDev · 2023-01-10T16:12:57Z

I asked if we could at least add content-length to each and every field.

I was actually wondering why there wasn't a content-length header -- it's a really poor design overall. What makes websocket frame parsing so much easier is that you know the exact payload length!

lib/formdata/constants.js

lib/formdata/util.js

lib/formdata/constants.js

anonrig · 2023-01-11T13:07:28Z

lib/fetch/dataURL.js

  // 1. Let result be the empty string.
  let result = ''

  // 2. While position doesn’t point past the end of input and the
  // code point at position within input meets the condition condition:
  while (position.position < input.length && condition(input[position.position])) {
    // 1. Append that code point to the end of result.
-    result += input[position.position]
+    if (inputIsString) {
+      result += input[position.position]


If inputIsString you might skip the while loop and provide a fast path for this function.

We can't skip the loop entirely, we need to increase the index and check if the character matches the condition

anonrig · 2023-01-11T13:09:41Z

lib/formdata/util.js

+    }
+  }
+
+  return false


Would chunk.find be faster? Inlined functions are most of the time faster due to severe v8 optimizations.

I doubt it, you can't specify a position to start at so I'd have to split the buffer first.

lib/formdata/parser.js

KhafraDev · 2023-02-24T21:33:21Z

I'm much more interested in replacing the asynchronous parsing with a synchronous parser. Not only is it actually spec compliant, but it will finally cleanup the body consumers. It doesn't make sense to parse a multipart/form-data body asynchronously if it's already in memory. It's also many times slower than Deno/bun (https://twitter.com/jarredsumner/status/1625067963656335361).

cc @jimmywarting I know you disagree with this, do you have any counterarguments before I replace busboy with the proposed alternative?

mcollina · 2023-02-24T23:20:52Z

I don't think a synchronous multipart parser would work for us long term.
Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.

KhafraDev · 2023-02-25T00:32:12Z

This doesn't seem to be an issue with Deno/bun (or any of the browsers, although their use case is different). Since request.body is a ReadableStream, we could document formData's downsides and show an example of asynchronously parsing the body with busboy (or this async parser). It makes it much harder to maintain this section of the code, and many of the WPTs regarding the formdata body mixin are disabled :(.

There is also a proposal to add a limit to the max body size a FormData can have (whatwg/fetch#1592) and it's already an option on undici's Agent. There are alternative apis that undici has that are much better suited for large bodies. The main users of .formData IIRC are library authors who want cross-env compatibility - I wonder how many users are active using it?

One more note: it's incredibly slow. https://twitter.com/jarredsumner/status/1625067963656335361

KhafraDev · 2023-02-25T00:36:00Z

There's also the issue of how parsing a multipart/form-data should be parsed and there's no actual spec for it. The WPT coverage sucks as well, while it's really good for every other mixin.

Personally I'd like to follow the spec and other platforms who have all implemented the parsing synchronously. See #1694 for proof.

mcollina · 2023-02-25T08:12:22Z

I don't think we could ever be in agreement on this one. Crashing for out-of-memory issues when receiving files is not safe.
It's better to not ship a feature than shipping a feature that will crash everything: the number of vulnerabilities we will receive for this one would be massive.

Last time I checked, it seemed that Chrome used disk-based Blobs for at least some of those cases.

The difference with .json() and the others exists because we need all the data to process a JS Objects, which is not the case for this one.

Node.js has just landed support for disk-baker Blobs, so we will soon be able to use this ;).

jimmywarting · 2023-02-25T14:34:42Z

I don't think a synchronous multipart parser would work for us long term.
Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.

I agree that disk based blob/files would be the best option and that async streaming is best for not hitting the RAM limit. would rather want to have something that's working and runs a bit slower. then rather (as @mcollina puts it) "not ship a feature at all" with the risk for out-of-memory issues.

it would also be technically possible to also write all of formdata payload to one single file first and then slice it (virtually - by just changing the offset + size, of where it should start/stop reading from)

const blob = fs.openAsBlob(path) // could include all formdata entries
const file1 = new File( [ blob.slice(100, 1024) ], entry.name, { type: entry.mimeType } )

so if you want to write all the data to the disc first and then synchronous search the file after each boundary to lookup where each file entry begin/ends then that could also be an option.

currently undici url encoded formdata decoder is synchronous.
b/c it dose not include files, and probably isn't as large as formdata payloads are. and that it uses URLSearchParams to decode everything.

So i'm wondering is this: https://twitter.com/jarredsumner/status/1625067963656335361 measuring URL encoded payload or multipart formdata encoded payload, cuz it's two completely different path of solving the issue at hand.

we could document formData's downsides

if we are going to document the downside of formdata, then we shouldn't provide them with an alternative solution to how they should do it themself using busboy and streams. it should just work as it's intended to work.
if we want to say that multipart/formdata it's slower then that's fine, recommend that ppl instead use URL encoded payload for faster parsing or use json. multipart/formdata is intended of file uploads so that is what it should be used for.

so if we should document anything at all, then it should be:

if you intend to send files then by all means you can use FormData() to send payload. but the multipart/formdata parsing will be slower then URLEncoded payloads. so, if you intend to send just simple forms with just text, then use new URLSearchParams() instead of new FormData() when you want to decode the payload with await body.formData() if you intend to send really big files then upload just raw bytes using { body: blob } as that has better streaming compatibility. choose the right encoding for the the job needed

KhafraDev · 2023-02-25T19:24:43Z

I agree that we probably won't find middle ground for this 😄.

it should just work as it's intended to work.

That's partially the issue - it doesn't work as intended, if by "intended" we're referring to the spec and/or user expectations. There are other slightly less noticeable issues and pretty crappy workarounds we're doing already to match the spec:

undici/lib/fetch/body.js

Lines 461 to 462 in 06f77a9

    
           // Wait a tick before checking if the request has been aborted. 
        
           // Otherwise, a TypeError can be thrown when an AbortError should.

.

Anyways, this branch is mostly done, passes all of busboy's tests, passes the same set of WPTs that are enabled. I'll finish up the docs eventually and I think we're good to go? Although I've been holding off because I know there will inevitably be issues that I won't have motivation to fix...

Sticking with busboy is also problematic because it doesn't seem well maintained. There's a number of issues and pull requests open. Plus I've already spent many hours working on this lol

… correctly

ronag · 2023-04-09T05:12:05Z

@KhafraDev What's the status on this? Are we using file backed Blob?

KhafraDev · 2023-04-09T13:19:12Z

It's mostly done, just need to spent a couple hours adding docs and cleaning stuff up. Is there a way to opt-in to using the file backed Blobs, or is it automatic?

jimmywarting · 2023-04-10T09:14:49Z

Is there a way to opt-in to using the file backed Blobs, or is it automatic?

Perhaps if content-length is lower than x bytes then it would construct blobs in memory?
Also what if you where to use response.blob()? would it be backed up by the fs then?

EDIT i tested response.blob in chrome. and looked at ~/Library/Application Support/Google/Chrome/Default/blob_storage and found that if the response was small it would just give you a in-memory blob.

but if the response was over a certain amount it would be offloaded it to the disc.
so in the browser this pretty much happens automatically. all tough you would also need to clean up the temporary stored file when there is no reference to it anymore...

i suppose the same logic could be done with FormData, if it encounter a file start reading n bytes, if it's larger than that, then dump it to the file system and pipe the rest of the data to that file.

ronag · 2024-02-25T09:54:53Z

@KhafraDev did you give up on this?

KhafraDev · 2024-02-25T15:01:06Z

yes, the parsing should be done synchronously to match the spec.

KhafraDev changed the title ~~feat: implement 0 dependency multipart/form-data parser~~ feat: implement 0 dependency streaming multipart/form-data parser Jan 10, 2023

anonrig reviewed Jan 11, 2023

View reviewed changes

maximelkin reviewed Feb 19, 2023

View reviewed changes

lib/formdata/parser.js Outdated Show resolved Hide resolved

KhafraDev added 15 commits February 25, 2023 22:51

feat: incomplete parser

c3c9cf3

feat: add in basic body parsing algorithm

438b7a3

fix: remove unused body var

3c6f6f8

fix: cleanup

4866b9a

feat: parse headers and emit file event

acb3fef

fix: differentiate between file and field and parse header attributes…

2d4decd

… correctly

run busboy test & support charset options

ae7c177

fix: byte-by-byte receiving

1d665b9

fix: byte-by-byte receiving

6ec1fd3

fix: error on _final if parsing isn't complete

119e172

feat: implement fileSize and fieldSize limits

b902ef1

feat: implement files limit

5629f71

fix: parse filename base and fix filename w/ backslashes

084a5e3

fix: only parse path if preservePath is false

771fa9d

fix: filename parsing and filename* parsing

8301ce8

KhafraDev added 21 commits February 25, 2023 22:51

fix: actively parse headers, rather than lazily

4c40486

fix(finally): validate header name and fix parser issues

6c1b276

fix: send leftover buffers to filestream on destroy, cleanup

601a4ea

fix: content-type headers cannot contain ;

cd6dd4b

enable test

52b4b0a

fix: content-type header with leading whitespace

d33ac7e

fix: don't emit field event w/o listeners

a2bcfd6

fix: don't emit field event w/o listeners

82ea7a3

feat: implement parts limit

0d99054

feat: implement fields limit

50669a6

fix: limit header length to 16 KiB

5dd8b45

omg every test passes

13f3cb5

fix: use path.win32.basename

ed92a82

fix: add types & highwatermark options

dd01bd3

fix: speedup header parsing & bug fixes & nyc coverage

787b679

fix: add main test file

7b15177

whoops, skip tests on < v18

ac81520

apply suggestions

e9c8f93

cleanup FileStream

7642b60

fix: replace busboy in fetch

588d62d

perf: use collectASequenceOfCodePointsFast

e21de08

KhafraDev force-pushed the undici-formdata-parser branch from dd02fe5 to e21de08 Compare February 26, 2023 04:03

KhafraDev closed this Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement 0 dependency streaming multipart/form-data parser #1851

feat: implement 0 dependency streaming multipart/form-data parser #1851

KhafraDev commented Jan 9, 2023 •

edited

Loading

codecov-commenter commented Jan 9, 2023 •

edited

Loading

kibertoad commented Jan 10, 2023

KhafraDev commented Jan 10, 2023

jimmywarting commented Jan 10, 2023

KhafraDev commented Jan 10, 2023

anonrig Jan 11, 2023

KhafraDev Jan 11, 2023

anonrig Jan 11, 2023

KhafraDev Jan 11, 2023

KhafraDev commented Feb 24, 2023

mcollina commented Feb 24, 2023

KhafraDev commented Feb 25, 2023

KhafraDev commented Feb 25, 2023

mcollina commented Feb 25, 2023

jimmywarting commented Feb 25, 2023 •

edited

Loading

KhafraDev commented Feb 25, 2023 •

edited

Loading

ronag commented Apr 9, 2023 •

edited

Loading

KhafraDev commented Apr 9, 2023

jimmywarting commented Apr 10, 2023 •

edited

Loading

ronag commented Feb 25, 2024

KhafraDev commented Feb 25, 2024

feat: implement 0 dependency streaming multipart/form-data parser #1851

feat: implement 0 dependency streaming multipart/form-data parser #1851

Conversation

KhafraDev commented Jan 9, 2023 • edited Loading

codecov-commenter commented Jan 9, 2023 • edited Loading

Codecov Report

kibertoad commented Jan 10, 2023

KhafraDev commented Jan 10, 2023

jimmywarting commented Jan 10, 2023

KhafraDev commented Jan 10, 2023

anonrig Jan 11, 2023

Choose a reason for hiding this comment

KhafraDev Jan 11, 2023

Choose a reason for hiding this comment

anonrig Jan 11, 2023

Choose a reason for hiding this comment

KhafraDev Jan 11, 2023

Choose a reason for hiding this comment

KhafraDev commented Feb 24, 2023

mcollina commented Feb 24, 2023

KhafraDev commented Feb 25, 2023

KhafraDev commented Feb 25, 2023

mcollina commented Feb 25, 2023

jimmywarting commented Feb 25, 2023 • edited Loading

KhafraDev commented Feb 25, 2023 • edited Loading

ronag commented Apr 9, 2023 • edited Loading

KhafraDev commented Apr 9, 2023

jimmywarting commented Apr 10, 2023 • edited Loading

ronag commented Feb 25, 2024

KhafraDev commented Feb 25, 2024

KhafraDev commented Jan 9, 2023 •

edited

Loading

codecov-commenter commented Jan 9, 2023 •

edited

Loading

jimmywarting commented Feb 25, 2023 •

edited

Loading

KhafraDev commented Feb 25, 2023 •

edited

Loading

ronag commented Apr 9, 2023 •

edited

Loading

jimmywarting commented Apr 10, 2023 •

edited

Loading