Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of writing a parquet file #232

Open
the80srobot opened this issue Dec 11, 2023 · 1 comment
Open

Example of writing a parquet file #232

the80srobot opened this issue Dec 11, 2023 · 1 comment

Comments

@the80srobot
Copy link

I'm trying to use this crate to write a parquet file, which seems like the road less traveled. None of the examples in this codebase actually generate a row group. I wonder if anyone's doing that, or if I'm the first one trying?

The API definitely seems like it's not been used for writing much, and there's a lot of sharp edges. For example, it looks like the crate wants to enable recycling page buffers - the Compressor reuses buffers from the pages passed to it, and it can theoretically return them again from into_innner, but the data is moved into FileWriter::write and can't be recovered.

Does anyone have examples of successfully using this crate to write a parquet file without doing a lot of allocations?

@the80srobot
Copy link
Author

I have code structured like this:

  • Build a Page by appending primitive values of NativeType or str.
  • Build a Vec of those. (Buffers get moved.)
  • Compress them using Compressor to get an iterator of CompressedPage = a column chunk. (Page buffers get reused.)
  • Build a DynIterator over the columns to build a row group
  • Pass the row group to the FileWriter
  • THIS IS WHERE THE INEFFICIENCY IS: there is no way to recover the original buffers after passing the row group to the FileWriter. To build the next row group you have to start over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant