-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zql [1/n] - v1 of end-to-end implementation of zql #29
Conversation
- select - where - limit - orderBy - count up later: - join - groupBy - subSelect but first we'll generate the AST for these operators and get them working e2e in Replicache.
This'll serve as the basis for MemorySource and then integration with Replicache
A connection to replicache is provided by a `Context` param, injected into queries. Rails will create query instances and will be responsible for injecting this param. The context provides: - a way to look up sources based on table/collection name - a materialite instance for the given Replicache instance
- orderBy fields are added as a hidden field on the mapped object - sources are created in the desired order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only looked at the first few commits and my comments are not substantive.
My main question that I want to understand is where, if at all, copies of entity data are happening from the source since it's so important to minimize them for performance. I hope that we can make this work in a zero-copy way, except for creating the final array/map to return to caller.
|
||
The components, in code, that make up the dataflow graph are: | ||
|
||
1. [ISource](./ivm/source/ISource.ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most nit, we don't usually label our interfaces with a prefix I
.
src/zql/schema/EntitySchema.ts
Outdated
@@ -0,0 +1,20 @@ | |||
export type Edge<TSrc extends EntitySchema, TDst extends EntitySchema> = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've read the design doc, I understand what this type does, but I think some comments would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Basically it's the same thing again about the "Edge" and "Node" terminology being unfamiliar. I wasn't sure if this was a node/edge in the query graph, or the data flow graph or ...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to Relationship
src/zql/query/EntityQueryType.ts
Outdated
@@ -0,0 +1,40 @@ | |||
/* eslint-disable @typescript-eslint/ban-types */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna take your word for it on this file. 😂
src/zql/query/EntityQueryType.ts
Outdated
readonly prepare: () => IStatement<TReturn>; | ||
} | ||
|
||
export interface IStatement<TReturn> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nittiest Nit, but can we just say Statement
? IStatement
is very dot-net.
|
||
Views can be subscribed to if a user wishes to be notified whenever the view is updated. | ||
|
||
Aaron was pretty adamant about using native JS collections, hence the `value` property on `TreeView` returns a JS array. This is fine for cases where the view has a `limit` but for cases where you want a view of thousands+ of items I'd recommend the `PersistentTreapView` (not available here, but in Materialite). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, here I am. I go back and forth about this – it's possible I'm wrong. My main concern is the dx – the dx of true arrays is so nice because JS has so many language helpers and so on for them. When @arv gets in here he'll have opinions I'm sure and may overrule.
I think that in practice we do not want people querying thousands of items, they should instead page them in < 1k at a time. But I can think of use cases for querying thousands of items at once.
This is just absurdly, mind-numbingly exciting btw. Absolutely on the edge of my seat to start playing with this. |
Any advice on reviewing this? Is it better to review commit by commit or just review the final results of the pr? |
There is a design doc that covers the concepts you should start with if you
haven’t already:
https://www.notion.so/replicache/WIP-ZQL-add4072bba85476ea7d34800176bcf8d?pvs=4
a (phone)
…On Sun, Mar 10, 2024 at 6:38 AM Erik Arvidsson ***@***.***> wrote:
Any advice on reviewing this? Is it better to review commit by commit or
just review the final results of the pr?
—
Reply to this email directly, view it on GitHub
<#29 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAATUBHR6UA2GRUYKKUI5I3YXSEBNAVCNFSM6AAAAABEOB2XRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGI4DSMBXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Its up to you guys. I can split this up into a PR per commit for review purposes and stack all of those. |
|
||
Would create 1k pipelines since we don't collapse over slots in stage 2. | ||
|
||
## Stage 3: Pipeline per Unique Unbound ZQL Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the edge of my seat for Stages 3+ :)
src/zql/query/EntityQuery.ts
Outdated
); | ||
} | ||
|
||
where<K extends keyof S['fields']>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been thinking you would have to select a field in order to filter on it in where. Of course that is not how SQL works, and I think what you have here makes more sense. It will complicate slightly the logic for determining which columns to sync to the client, but I don't think by much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also need logic for determining the 'required columns' for a query on the client, so that we can filter the source to just entities that have the 'required columns'. I'm imagining this would be the very first step in the pipeline.
src/zql/query/IEntityQuery.ts
Outdated
readonly count: () => IEntityQuery<TSchema, number>; | ||
readonly where: <K extends keyof TSchema['fields']>( | ||
f: K, | ||
op: Operator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which operators are valid is dependent on the type of the field, and then the type for value is dependent on the type of the operator.
I don't have the full typing worked out, but I think instead of just Operator
, we need
NumberOperator
, StringOperator
, BooleanOperator
, and eventually SubqueryOperator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which operators are valid is dependent on the type of the field, and then the type for value is dependent on the type of the operator.
Yeah. Let me see what I can work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do another high level pass tomorrow.
- phase1: enqueue values - phase2: run operators - post-commit: notify observers
As you saw, it does create a new object in order to apply the selection set. We could omit this step. Where I hook it into Replicache it uses Maybe this can be omitted too? Since, presumably, Replicache already has all of this stuff in-memory? I'll need to dive into the Replicache internals to find out or get some pointers from @arv / @grgbkr |
The ultimate arbiter is going to be framerate in Repliear. Before, you managed to beat our carefully built code using abstractions that I felt would be expensive. So maybe @arv is right and we build it the reasonable way and see how it performs. But I'm warning you now that I'm likely to keep pestering about taking copies off the read path since I know it will make us faster and it's like free performance.
Yes, Replicache maintains a memory cache and is very careful to take the object returned by IDB, stick it in the cache, and pass it to the user with no copies. This is the way we have found is fastest historically. |
but since declaration merging is also banned? we move Statement->StatementImpl
@arv - lmk if this is close enough to merge. The big picture should all come together when map & filter aren't too interesting / this is a lot more than needed for just map & filter since it lays groundwork for future iterations. You can make a pipeline that is only map & filter incremental in a few lines of code. You do start to need all the complications of a graph once you want to:
other big picture references:Linear with 1 million items, updating in realtime: source: https://github.com/vlcn-io/materialite/tree/main/demos/linearite linear-mil.movWalkthrough of React bindings: Benchmarks: |
Lets land this. It will be easier to make changes to main in the future |
Boo. Ya.
…On Thu, Mar 14, 2024 at 3:34 PM Matt Wonlaw ***@***.***> wrote:
Merged #29 <#29> into main.
—
Reply to this email directly, view it on GitHub
<#29 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAATUBAWVBUHO7QTKOT4OLTYYJF3BAVCNFSM6AAAAABEOB2XRWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGEZDKNRVGUZTSMY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
For a high level overview before diving into the code, read below
ZQL
./query/EntityQuery.ts is the main entrypoint for everything query related:
Creating an EntityQuery
First, build your schema for rails as you normally would:
Then you can create a well-typed query builder
EntityQuery
is the integration point between the query builder and Replicache. It provides the query builder with a way to gain access to the current Replicache instance and collections. SeemakeTestContext
for an example.EntityQuery
is the sameprefix
parameter that is passed to railsgenerate
. It is used to identify the collection being queried.EntityQuery
./query/EntityQuery.ts
EntityQuery
holds all the various query methods and is responsible for building the AST (./ast/ZqlAst.ts
) to represent the query.Example:
Under the hood,
where
,join
,select
, etc. are all making a copy of and updating the internalAST
.Key points:
EntityQuery
is immutable. Each method invoked on it returns a new query. This prevents queries that have been passed around from being modified out from under their users. This also makes it easy to fork queries that start from a common base.EntityQuery
is a 100% type safe interface to the user. Layers belowEntityQuery
which are internal to the framework do need to ditch type safety in a number of places but, since the interface is typed, we know the types coming in are correct.EntityQuery
that returnthis
does not and will not ever matter. All permutations will return the same AST and result in the same query.Once a user has built a query they can turn it into a prepared statement.
Prepared Statements
./query/Statement.ts
A prepared statement is used to:
Lifetime - A statement will subscribe to its input sources when it is subscribed or when it is materialized into a view. For this reason, statements must be cleaned up by calling
destroy
.Bindings - not yet implemented. See the ZQL design doc.
Query de-duplication - not yet implemented. See the ZQL design doc.
Materialization - the process of running the query and, optionally, keeping that query's results up to date. Materialization can be 1-shot or continually maintained.
Prepared Statement Creation
./ast-to-ivm/pipelineBuilder.ts
When the user calls
query.prepare()
theAST
held by the query is converted into a differential dataflow graph.The resulting graph/pipeline is held by the prepared statement. The
pipelineBuilder
is responsible for performing this conversion.high level notes on dataflow
The pipeline builder walks the AST --
FROM
andJOIN
) they are added as sources to the graph.JOIN
, adds aJOIN
operator to join the two mentioned sources.WHERE
conditions, those are added as filters against the sources.SELECT
statements, those are added asmap
operations to re-shape the results.ORDER BY
andLIMIT
are retained to either be passed to the source provider, view or both.Dataflow Internals: Source, DifferenceStream, Operator, View
Also see: ./ivm/README.md
The components, in code, that make up the dataflow graph are:
JOIN
)SELECT
as well asfunction
application)GroupBy
)WHERE
andON
statements orHAVING
when applied after a reduction)COUNT
)Conspicuously absent are
LIMIT
andORDER BY
. These are handled either by the sources or views. A future section is devoted to these two.The above components would be composed into a graph like so:
Query Execution
Query execution, from scratch, and incremental maintenance are nearly identical processes.
The dataflow graph represents the execution plan for a query. To execute a query is a simple matter of sending all rows from all sources through the graph.
Execution can be optimized in the case where a
limit
andcursor
are provided (not yet implemented here).In other words, if:
We can jump to that set of rows rather than feeding all rows. If a limit is specified we can stop reading rows once we hit the limit.
The limit functionality is implemented by making
Multiset
lazy. SeeMultiset.ts
for how this is currently implemented. The limited view that is pulling values from a multiset can stop without all values being visited.Not yet implemented would be index selection. E.g., queries of the form:
should just be lookups against the primary key rather than a full scan.
If a view's order does not match a source's order, we will (not yet implemented here) create a new version of the source that is in the view's order. This source will be maintained in concert with the original source and used for any queries that need the given ordering.
Incremental Maintenance
Incremental maintenance is simply a matter of feeding each write through the graph as it happens. The graph will produce the correct aggregate result at the end.
The details of how that works are specific to individual operators. For operators that only use the current row, like map & filter, it is trivial. They just emit their result. For join and reduce (not implemented yet) it is more complex.
What has been implemented here lays the groundwork for join & reduce hence why it isn't as simple as a system that only needs to support map & filter.
Sources: Stateful vs Stateless
Sources model tables. A source can come in stateful or stateless variants.
A stateless source cannot return historical data to queries that are subscribed after the source was created.
A stateful source knows it contents. When a data flow graph is attached to it, the full contents of the source is sent through the graph, effectively running the query against historical data.
Views: ValueView, TreeView
There are currently two kinds of views:
ValueView
andTreeView
.ValueView
maintains the result of acount
query.TreeView
maintainsselect
queries.TreeView
holds a comparator which uses the columns provided toOrder By
to sorts it contents. If noOrder By
is specified then the items are ordered byid
. Any time anOrder By
is specified that does not include theid
, theid
is appended as the last item to order by. This is so we get a stable sort order when users sort on columns that are not unique.Views can be subscribed to if a user wishes to be notified whenever the view is updated.
Aaron was pretty adamant about using native JS collections, hence the
value
property onTreeView
returns a JS array. This is fine for cases where the view has alimit
but for cases where you want a view of thousands+ of items I'd recommend thePersistentTreapView
(not available here, but in Materialite).OR, Parenthesis & Breadth First vs Depth First Computation
This PR does not support
OR
or nested conditions but does lay the groundwork for it by executing the dataflow graph breadth fist rather than depth first.This is the reason for the split of dataflow events between
enqueue
andnotify
orrun
andnotify
for operators.See the commentary on
IOperator
inOperator.ts
Transactions
The IVM system here has a concept of a transaction. It enables:
See commentary in
ISourceInternal
inISource.ts
as well as onIOperator
inOperator.ts
.zkl-compressed.mov