Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that stream API needs polyfill to be used as async iterable #307

Open
kylebarron opened this issue Sep 4, 2023 · 5 comments
Open

Comments

@H-Plus-Time
Copy link
Contributor

Ah, drat. I do wonder how useful (and hopefully unobtrusive) running some of the test suite through playwright would be. Happy to contribute if its desirable (potentially side benefits for the web examples).

@kylebarron
Copy link
Owner Author

Another thing I noticed was that making lots of different requests is quite annoying and you get latency between chunks.

One crazy idea I just had is whether it would be possible to mix the stream and async approaches... e.g. first do an end request for the metadata, but then do a full-file streaming request. In theory, you could make the streaming read work because you know the byte ranges of every chunk in the file. But that's probably totally incompatible with the existing parquet rust apis.

@H-Plus-Time
Copy link
Contributor

Yeah, I noticed that too - if I'm reading parquet2's source correctly, it does $n_{\text{row groups}} \cdot m_{\text{flattened fields}}$ requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks). Toning that down to just one request per row group might be worthwhile. One thing - I noticed in that observable UScounties.parquet ends up being served over http1.1, that'd probably be worsening the problem (http/2 would still be subject to the inter-row group latency).

@kylebarron
Copy link
Owner Author

it does nrow groups⋅mflattened fields requests, (struct and fixed size list fields appear to get their own requests, and presumably so do the dictionary blocks).

Oh yikes! I didn't notice that before. Indeed, looking at that observable example, I see 58 total requests to the parquet file! I'm guessing the first two are for the metadata, and there are 7 row groups in the file, so that adds up to 8 requests per row group, where there are 6 columns plus the geometry column.

Presumably parquet2 was designed for a cloud environment where latency is assumed to be quite cheap.

Toning that down to just one request per row group might be worthwhile.

Yeah that seems ideal.

@H-Plus-Time
Copy link
Contributor

H-Plus-Time commented Sep 15, 2023

Update on the http1.1 vs http2 point, and cloudflare's range requests:

So it looks like http/2 does help a fair bit (about 9+/-2s seconds vs 20+/-2 seconds), but it's still 3x slower than one big request (~2.2+/-.5 seconds on a 50Mb/s link, more or less full saturation).

Those requests are also serial at the column level (in addition to the row group-serialized behaviour we expect, given this is a pull stream). Calling readRowGroupAsync a bunch without awaiting (i.e. you get an array of Promise) gets you row-group parallel, column serial, which might improve things a bit (wrap that in an async generator and loop through your array of promises, awaiting each in sequence, and you have a rough approximation of a push-style stream).

I think given the intention is there to do column-selective reads, it would probably be worthwhile seeing if we could rework things to dispatch a row group's column requests concurrently or coalesce them into a multi-byte range request (requires a preflight for cors requests, but likely that'd just be one per file).

Cloudflare R2's range requests also appear to be inordinately slow in several ways:

  1. large (say, 20% of the file) slices toward the end of the file frequently took as long or longer than the full request :-/.
  2. ~300ms server-initiated waits every 4th request.
  3. Glacial transmission rates on very small slices (300ms for ~2kB - so around 6kB/s)

Unusual, given Cloudflare's usually stellar perf - I'm inclined to believe the situation hasn't changed since 2021 (see The Impossibility of Perfectly Caching HTTP Range Requests for an interesting write-up and comparison of the major CDN players' approaches), and they're rounding to the entire file.

Doing the same requests in parallel (all columns, all row groups) brings the total time down to the full-size request's (because each and every request hits a different cloudfront CDN worker), though the lack of ordering means the first large column is usually the last to finish (presumably something can be done with the fetch priority flag).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants