You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is where we'll discuss adding a Query API to allow users to efficiently retrieve the data they're looking for. Users can perform column chunk/page-level predicate pushdown manually using column and offset indexes, but this is quite labor-intensive and requires some knowledge of how parquet files work.
Ideally, users should be able to write simple queries that are analyzed and used to construct efficient query plans/predicate functions which evaluate column chunk and page statistics. We could also try to optimize data fetching, especially over the network, by making multirange queries (when supported or specified via a flag) and concatenating requests for (nearly) adjacent byte ranges.
User-defined predicate functions are another option (potentially just a lower-level API), which can be implemented with fairly little effort. This would allow users to define arbitrary predicate logic and would also be a valid way of implementing different query frontends (eg: the simple structured queries discussed earlier).
The text was updated successfully, but these errors were encountered:
Sorry for the delay here @park-brian. I'm ready to start re-visiting pushdown predicates in hyparquet, and so I have merged the parse-indicies branch into master. Will include with the next published version.
I know you had some thoughts and maybe some additional experiments that you did. I would be very interested in ideas and pull requests related to more sophisticated query support.
Thanks @platypii, this is great news! I've also been quite busy these days, so I apologize for not getting to follow up on this effort.
I can start by creating some utility functions which use predicate functions to select row groups based on their statistics. Since row group statistics and page statistics are similar, we should be able to use the same predicate to filter both.
Of course, I'll also want to fetch adjacent row groups/pages as well, depending on which columns the user requests. Sometimes, row groups/pages are not aligned so I'll want to take this into account as well.
This is where we'll discuss adding a Query API to allow users to efficiently retrieve the data they're looking for. Users can perform column chunk/page-level predicate pushdown manually using column and offset indexes, but this is quite labor-intensive and requires some knowledge of how parquet files work.
Ideally, users should be able to write simple queries that are analyzed and used to construct efficient query plans/predicate functions which evaluate column chunk and page statistics. We could also try to optimize data fetching, especially over the network, by making multirange queries (when supported or specified via a flag) and concatenating requests for (nearly) adjacent byte ranges.
User-defined predicate functions are another option (potentially just a lower-level API), which can be implemented with fairly little effort. This would allow users to define arbitrary predicate logic and would also be a valid way of implementing different query frontends (eg: the simple structured queries discussed earlier).
The text was updated successfully, but these errors were encountered: