-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve: reuse Arc<dyn Array>
in parquet record batch reader.
#4864
Comments
In my opinion, we can store the primitive arrays once we read in the parquet record batch reader, and combine them into a |
The challenge is bounding the memory usage of the reader and balancing this against the cost of decoding, which for primitives is relatively low. I don't think it is as simple as "keep the arrays around", it will probably require some sort of heuristic |
We can only keep the arrays we need to output and release the others.
Besides decoding there is also decompression. And it may become a large cost for variable-length type like string.
Yes. Such as all the output arrays are contained in predicate arrays. |
That is still potentially a non-trivial amount of data, it could theoretically be an entire column chunk worth, which could easily blow your memory budget 😅 Perhaps we could add a configurable threshold of the maximum number of bytes to keep around in this way, and fallback to decoding columns again if this threshold is exceeded? |
🎣 -- I believe @XiangpengHao was thinking about working on this (I know you have a paper to write too, etc lol) |
Both in
arrow_reader
andasync_reader
, if there are predicates, the reader will first read arrays (wrapped inRecordBatch
) to evaluate the predicates and get a row selection. And then the reader will use the row selection to output the final needed arrays.If some arrays in the final output are contained in the prefetched arrays, we also have to deserialize them again. This is quite wasteful.
We should have a reasonable way to reuse the arrays.
The text was updated successfully, but these errors were encountered: