-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document that stream API needs polyfill to be used as async iterable #307
Comments
Ah, drat. I do wonder how useful (and hopefully unobtrusive) running some of the test suite through playwright would be. Happy to contribute if its desirable (potentially side benefits for the web examples). |
Another thing I noticed was that making lots of different requests is quite annoying and you get latency between chunks. One crazy idea I just had is whether it would be possible to mix the stream and async approaches... e.g. first do an end request for the metadata, but then do a full-file streaming request. In theory, you could make the streaming read work because you know the byte ranges of every chunk in the file. But that's probably totally incompatible with the existing parquet rust apis. |
Yeah, I noticed that too - if I'm reading parquet2's source correctly, it does |
Oh yikes! I didn't notice that before. Indeed, looking at that observable example, I see 58 total requests to the parquet file! I'm guessing the first two are for the metadata, and there are 7 row groups in the file, so that adds up to 8 requests per row group, where there are 6 columns plus the geometry column. Presumably parquet2 was designed for a cloud environment where latency is assumed to be quite cheap.
Yeah that seems ideal. |
Update on the http1.1 vs http2 point, and cloudflare's range requests: So it looks like http/2 does help a fair bit (about 9+/-2s seconds vs 20+/-2 seconds), but it's still 3x slower than one big request (~2.2+/-.5 seconds on a 50Mb/s link, more or less full saturation). Those requests are also serial at the column level (in addition to the row group-serialized behaviour we expect, given this is a pull stream). Calling readRowGroupAsync a bunch without awaiting (i.e. you get an array of Promise) gets you row-group parallel, column serial, which might improve things a bit (wrap that in an async generator and loop through your array of promises, awaiting each in sequence, and you have a rough approximation of a push-style stream). I think given the intention is there to do column-selective reads, it would probably be worthwhile seeing if we could rework things to dispatch a row group's column requests concurrently or coalesce them into a multi-byte range request (requires a preflight for cors requests, but likely that'd just be one per file). Cloudflare R2's range requests also appear to be inordinately slow in several ways:
Unusual, given Cloudflare's usually stellar perf - I'm inclined to believe the situation hasn't changed since 2021 (see The Impossibility of Perfectly Caching HTTP Range Requests for an interesting write-up and comparison of the major CDN players' approaches), and they're rounding to the entire file. Doing the same requests in parallel (all columns, all row groups) brings the total time down to the full-size request's (because each and every request hits a different cloudfront CDN worker), though the lack of ordering means the first large column is usually the last to finish (presumably something can be done with the fetch priority flag). |
Ref https://observablehq.com/d/f5723cea6661fb71, https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#browser_compatibility, https://jakearchibald.com/2017/async-iterators-and-generators/#a-shorter-implementation
cc @H-Plus-Time
The text was updated successfully, but these errors were encountered: