-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write to parquet file? #6
Comments
No plans for writing parquet files at this time. I could be convinced otherwise, but generally I feel that if you are creating parquet files, you are more likely to be in a backend environment so it makes sense to use existing parquet libraries in like python, C++ or Rust. What I really want with this library is to make it easy to view parquet data in the browser, since there was no good library for decoding parquet files in javascript that was lightweight and could handle remote files efficiently. You might like the work of @kylebarron on parquet-wasm. Hope you find what you need! |
Generally agree that "webassembly" and "lightweight" are not synonyms, but there's no technical blocker to handling remote files efficiently in parquet-wasm. In the latest release you're able to fetch individual row groups or columns from a Parquet file without downloading the entire file. And we could implement something like pyarrow's |
I'm creating an offline application (with local JS server and web application using Electron) that stores transactional data locally in the backend/server, and then uploads it to S3 to be analyzed with cloud-native tools such as Athena, QuickSight etc. I'm looking for a lightweight library to read/write to Parquet file, and your library ticks all boxes except for the write function. |
You can use WebAssembly in Electron, so parquet-wasm should work out of the box. |
I know, all the data handling hairband in the server which will take care of persisting files and such. The client will be very "stupid" and just display stuff.
Sent from Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Kyle Barron ***@***.***>
Sent: Monday, April 29, 2024 4:39:59 PM
To: hyparam/hyparquet ***@***.***>
Cc: Erik Norman ***@***.***>; Author ***@***.***>
Subject: Re: [hyparam/hyparquet] Write to parquet file? (Issue #6)
You can use WebAssembly in Electron, so parquet-wasm should work out of the box.
—
Reply to this email directly, view it on GitHub<#6 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJZL7L5KDYWUWO3MIFN2QALY7ZLT7AVCNFSM6AAAAABG44IV7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSHEZDOMBZHA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
parquet-wasm has 5+ megabytes of wasm file, hyparquet is sub-100k of javascript. Loading can be much faster especially for time to first render. Because hyparquet is not a compiled wasm blob, there is no need for transferring data across the wasm boundary, and no cold-start time for loading the wasm vm. Also I've done some optimizations for the web like if you are fetching a bunch of columns in a rowgroup, it will fetch the data in just one http request instead of multiple round trips. I'm guessing that parquet-wasm, if you can implement ranged-gets, probably doesn't coalesce the requests to save round trip time? Huge respect for your work Kyle, I love reading your blog about parquet stuff. Definitely not knocking parquet-wasm! Just pointing out the reasons I built hyparquet. :) |
That's very fair! I think it's valuable to have a pure-JavaScript implementation! My own bias is that Parquet is an absolutely perfect place for WebAssembly, because Parquet is such a complex spec with such a long tail of complexities. It's not that I don't want a pure-JS implementation; rather my own conclusion was that implementing a stable pure-JS Parquet implementation that supports all encodings and compressions would be an absolutely massive engineering effort. Most previous JS Parquet implementations were eventually abandoned. Whereas there are a ton of people building databases in Rust, so the Parquet implementation is stable, fast, and loads into a binary representation. Perhaps it's a use case where the benefits of WebAssembly outweigh the costs. So take encouragement with a hint of skepticism 🙂. If you're able to implement a stable pure-JS Parquet reader, it'll be really impressive!
1.2MB brotli-compressed 😉 , but yes. We might have alternate use cases; you might care more about time to first render whereas I'm more focused on handling large datasets where Parquet 1.2MB is very small compared to the data savings from Parquet.
It does. Multiple ranges are coalesced by default. The coalesce size is currently 1MB and not configurable though. |
Also note that the people in loaders.gl are also building a pure-Typescript Parquet implementation, which I think was forked from parquets. It might be worth reaching out to them |
it's particularly valuable when we're interested only in reading the metadata. |
Your use case involves reading the metadata only... but not the data? |
Oh if we're talking compressed size, then hyparquet is 24.1kb compressed 😉 |
Yes, we just launched a Parquet metadata viewer: https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/data/CC-MAIN-2013-20?show_file_info=data%2FCC-MAIN-2013-20%2F000_00000.parquet It's powered by hyparquet! |
Is it possible to write to parquet file using this library? (quickly checked the code, didn't see any write function).
The text was updated successfully, but these errors were encountered: