Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Projection Expressions #1782

Open
gatesn opened this issue Jan 2, 2025 · 9 comments
Open

Projection Expressions #1782

gatesn opened this issue Jan 2, 2025 · 9 comments
Assignees
Labels

Comments

@gatesn
Copy link
Contributor

gatesn commented Jan 2, 2025

Vortex doesn't have the concept of tables, just arrays that can but don't need to be of a struct dtype.

Therefore representing the projection of a scan in the traditional way (integer indexing of possibly flattened columns) doesn't make much sense.

Instead, Vortex scans should support arbitrary projection expressions. The vortex expression library includes an "Identity" expression that refers to the array currently in scope, and then getitem, pack, and other such struct expressions for accessing or assembling struct fields.

A select * projection would turn into Identity, and a select a, b would turn into Pack{a: GetItem(Identity, "a"), b: GetItem(Identity, "b")}

A couple of nice properties of this:

  1. We can support pruning arbitrarily complex nested data types.
  2. We can support push-down of arbitrary expressions. For instance, if DataFusion supported projection expressions we could evaluate scalar functions over compressed vortex arrays before returning Arrow to DataFusion.
@gatesn gatesn added the epic label Jan 2, 2025
@gatesn gatesn changed the title Non-tabular Projections Non-tabular projections Jan 2, 2025
@gatesn gatesn changed the title Non-tabular projections Projection Expressions Jan 2, 2025
@robert3005
Copy link
Member

One thing that we currently lack is to be able to extract refernced fields in order out of expression for the sake of pruning children. This will require a new transformation to extract them out of all expression

@danking
Copy link
Member

danking commented Jan 2, 2025

Vortex expressions currently lack Pack, GetItem, etc. They do have Column(f) which should get replaced by GetItem(Identity, f).

Currently, only the chunked layout reader prunes. It receives an already-projected expression from its parent (currently always a columnar layout reader). I don't think we have any tests or benchmarks with struct arrays that have struct-typed fields, so, IIUC, the pruner never sees expressions with Column. It only sees expressions with Identity.

AFAICT, when writing struct{a: struct{b: u32, c: u32}}, the writer will write the inner struct as an IPC message, rather than flattening/recursively splitting the columns.

Select::Include and Select::Exclude should take a struct-typed argument and project the (top-level) fields of that argument.

Select: struct -> struct. GetItem: struct -> non-struct. Pack: non-struct ... -> struct.

Maybe useful: Merge or Union which combines two structs (perhaps error'ing if the field names overlap). I don't see such a function in BigQuery.


I think the layout readers were already written to support this use case. Layouts are created from LayoutDescriptors using this function:

    pub fn read_layout(
        &self,
        path: LayoutPath,
        layout: fb::Layout,
        scan: Scan,
        dtype: Arc<LazyDType>,
    ) -> VortexResult<Arc<dyn LayoutReader>>

A Scan is just an optional expression:

pub struct Scan {
    expr: Option<ExprRef>,
}

The VortexReadBuilder needs to accept an expression (what is that method called? with_projection_expression?). Probably good to keep supporting with_projection because, as a client I'd expect at least that simple API to exist.

@danking
Copy link
Member

danking commented Jan 2, 2025

It seems a little surprising as a user to have an array with the following structure but which doesn't support column-projection on the nested columns. The inner struct is written as a flat layout.

struct {
  a: chunked { struct {
     x: u32
     y: u32
     z: u32
  } }
  b: u64
}

I think we could add a flattened-columnar layout with not much effort. It stores the true dtype. It delegates to columnar layout for reading. It passes down an expression like Pack(a: Pack(x: GetItem(Identity, "a_x"), y: ..., ...). That shouldn't copy any data buffers, but it will construct the dtype. Constructing a dtype seem unavoidable currently: calling dtype on a VortexReadArrayStream also constructs the dtype from the lazy dtype bytes.

Due to #710 this would lose the nullability on both the outer struct and the a struct.

@danking
Copy link
Member

danking commented Jan 2, 2025

As for (2), Marko had a working prototype of this from our hackathon project: dk/vortex-genomics branch. He had to implement it at the physical level (using physical optimizers on physical execution plans). It only works on the select ... clause of a DF SQL statement. I think we can (or already do?) handle the where clause via filter pushdown.

I'm not sure how the group by clause is handled by DF. The only non-column-identifier group by clause I see is click bench 42: GROUP BY DATE_TRUNC('minute', "EventTime"). A date time parts encoded array could avoid decoding the sub seconds and could maybe compute on the compressed seconds?

@danking
Copy link
Member

danking commented Jan 2, 2025

Vortex doesn't have the concept of tables, just arrays that can but don't need to be of a struct dtype.

This isn't quite true of vortex-file. It only supports struct arrays and even then drops their validity (#710). It's maybe less effort to change the writer once to support all three of: (1) nested column-projection, (2) struct validity, and (3) non-struct arrays.

@danking
Copy link
Member

danking commented Jan 2, 2025

Here's a worklist proposal:

  • (1) Write examples of alternative expression definitions.
    a. Node { expr_id: enum, children } @danking
    b. Node { expr_id: Box, children } @danking
    c. NilaryNode, UnaryNode, BinaryNode, NAryNode, ..
    d. DataFusion-style
  • (2) result_dtype(array_dtype: DType) -> DType @danking feat: teach VortexExpr to dtype #1811
  • (3) On VortexExpr we need a traverse(self) -> Self. @joseph-isaacs
  • (4) Add GetItem & update Select. Add tests that write nested structs and verify we can (logically, not physically) project a column from a nested structure (e.g. grab a.x and a.z from the above example). @joseph-isaacs
  • (5) Add Pack. @danking feat: implement Pack expression #1810
  • (6) Remove Column.
  • (7) Modify the builder to accept a projection expression rather than a list of columns. Add some tests. Fix any breakage in the layouts (I think it should all just work?).
  • (8) Add flattened-columnar layout which mostly delegates to columnar layout but uses Pack to recreate the expected nesting levels.
  • (9) Teach writer & reader to preserve struct validity. I think we just need to read/write the validity array and the validity metadata.
  • (10) Teach the writer & reader to write non-struct arrays. We probably want write_array which dispatches on column vs non-struct. This is included in the new layout crate, but not yet available in vortex-file: Initial Vortex Layouts #1805
  • (11) Do some thinking about how configurable the writer should be. Should it always respect the array structure? Should it support changing the chunk granularity on write? What about row groups? We do not explicitly preserve the row groups in the layouts (if we had a chunked of struct, the layout would be struct of chunked where each chunked layout has the same row count boundaries). I believe this is superseded by Initial Vortex Layouts #1805 .

@gatesn
Copy link
Contributor Author

gatesn commented Jan 3, 2025

Just before we dive in, we should also think about how the behavior of this works as part of scanning in a vortex-layout world. I'll try to push up some scaffolding so we have something concrete to discuss on that front.

@joseph-isaacs
Copy link
Member

I think there are a few things missing from the work list.

a. Provide a way to find the result type of an expression. This could we a default method creating a empty array and push it through the system, or it could be a new method fn result_type(children: dtype) -> dtype
b. add some expression patterns, such as tree traversal

Also we might be a time to review the vortex expression definition, for example we might want to have a enum definition such as the logical-datafusion expr.

@joseph-isaacs
Copy link
Member

I think that we will be able to push down casts into the decode, these happen in many places, e.g. cast before avg avg(hits.UserID)

gatesn added a commit that referenced this issue Jan 3, 2025
Initial implementation of the new structure of vortex layouts per #1676 

* Only flat layout works.
* I'm not 100% sure on the trait APIs, these will evolve as we pad out
the implementation.
* StructLayout will be worked on as part of #1782 so will probably come
last.

Up next:
* Implementation of ChunkedLayout

Open Questions:
* What is the API that e.g. Python users have to precisely configure
layout strategies? Can I override a layout writer for a specific field?
* Similarly, how can we configure the layout scanners? Can I configure a
level 0 chunked layout differently from level 2 in a
chunk-of-struct-of-chunk world?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants