-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projection Expressions #1782
Comments
One thing that we currently lack is to be able to extract refernced fields in order out of expression for the sake of pruning children. This will require a new transformation to extract them out of all expression |
Vortex expressions currently lack Pack, GetItem, etc. They do have Currently, only the chunked layout reader prunes. It receives an already-projected expression from its parent (currently always a columnar layout reader). I don't think we have any tests or benchmarks with struct arrays that have struct-typed fields, so, IIUC, the pruner never sees expressions with AFAICT, when writing
Maybe useful: I think the layout readers were already written to support this use case. Layouts are created from LayoutDescriptors using this function:
A
The |
It seems a little surprising as a user to have an array with the following structure but which doesn't support column-projection on the nested columns. The inner struct is written as a flat layout.
I think we could add a flattened-columnar layout with not much effort. It stores the true dtype. It delegates to columnar layout for reading. It passes down an expression like Due to #710 this would lose the nullability on both the outer struct and the |
As for (2), Marko had a working prototype of this from our hackathon project: I'm not sure how the group by clause is handled by DF. The only non-column-identifier group by clause I see is click bench 42: |
This isn't quite true of vortex-file. It only supports struct arrays and even then drops their validity (#710). It's maybe less effort to change the writer once to support all three of: (1) nested column-projection, (2) struct validity, and (3) non-struct arrays. |
Here's a worklist proposal:
|
Just before we dive in, we should also think about how the behavior of this works as part of scanning in a vortex-layout world. I'll try to push up some scaffolding so we have something concrete to discuss on that front. |
I think there are a few things missing from the work list. a. Provide a way to find the result type of an expression. This could we a default method creating a empty array and push it through the system, or it could be a new method Also we might be a time to review the vortex expression definition, for example we might want to have a enum definition such as the logical-datafusion expr. |
I think that we will be able to push down casts into the decode, these happen in many places, e.g. cast before avg |
Initial implementation of the new structure of vortex layouts per #1676 * Only flat layout works. * I'm not 100% sure on the trait APIs, these will evolve as we pad out the implementation. * StructLayout will be worked on as part of #1782 so will probably come last. Up next: * Implementation of ChunkedLayout Open Questions: * What is the API that e.g. Python users have to precisely configure layout strategies? Can I override a layout writer for a specific field? * Similarly, how can we configure the layout scanners? Can I configure a level 0 chunked layout differently from level 2 in a chunk-of-struct-of-chunk world?
Vortex doesn't have the concept of tables, just arrays that can but don't need to be of a struct dtype.
Therefore representing the projection of a scan in the traditional way (integer indexing of possibly flattened columns) doesn't make much sense.
Instead, Vortex scans should support arbitrary projection expressions. The vortex expression library includes an "Identity" expression that refers to the array currently in scope, and then
getitem
,pack
, and other such struct expressions for accessing or assembling struct fields.A
select *
projection would turn intoIdentity
, and aselect a, b
would turn intoPack{a: GetItem(Identity, "a"), b: GetItem(Identity, "b")}
A couple of nice properties of this:
The text was updated successfully, but these errors were encountered: