You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like while reading a parquet file usingVariablePageSizeColumnChunkPageStore, to reach a particular page, we’re materializing every prior page sequentially. This makes the reading process extremely serial since pages cannot be read independently in parallel.
We should check if we can avoid the materialization for pages before the target page by simply reading headers to determine the page size boundaries instead of materializing whole pages prematurely.
We can also looking at widening the lock so we have less switching, too… right now it’s in extendOnePage.
The text was updated successfully, but these errors were encountered:
On further inspection, its clear that we don't materialize whole pages to reach any particular page but just iterate over all the headers to calculate the number of rows. We are fixing this issue by introducing a new offset index based page store for faster page access as part of #4844
As noted earlier, VariablePageSizeColumnChunkPageStore does not materialize full pages but just the page headers.
Also, going forward, we are using OffsetIndexBasedColumnChunkPageStore whenever offset index is present in the file, since it provides faster access for pages.
So we don't have any immediate need or clear path to optimize VariablePageSizeColumnChunkPageStore any further.
It looks like while reading a parquet file using
VariablePageSizeColumnChunkPageStore
, to reach a particular page, we’re materializing every prior page sequentially. This makes the reading process extremely serial since pages cannot be read independently in parallel.We should check if we can avoid the materialization for pages before the target page by simply reading headers to determine the page size boundaries instead of materializing whole pages prematurely.
We can also looking at widening the lock so we have less switching, too… right now it’s in extendOnePage.
The text was updated successfully, but these errors were encountered: