Implement Window
capability in Dozer SQL
#893
Closed
mediuminvader
started this conversation in
Feature Requests
Replies: 2 comments 8 replies
-
I'm wondering if these functions exist in other Sql dialects or we just invented them. Questions:
|
Beta Was this translation helpful? Give feedback.
8 replies
-
Should we close this discussion? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
WINDOWING
A window is a viewport onto a buffer, it gives you a snapshot of a stream within a given timeframe and can be set as hopping or tumbling.
In Dozer a WINDOW is a function that bounds a
Record
inside a timeframe, generates the columns necessary to windowing and forwards the enrichedRecord
to the Caching layer; The Caching layer will use these columns to apply Windowing operations and eviction policies to theRecord
.DozerSQL supports tumbling window with the
TUMBLE
function and hopping window with theHOP
function.TUMBLE()
The window is referred to as a tumbling window when each window is processed in a non-overlapping manner. In this case, each record on a stream belongs to a specific window and it is processed only once.
Tumbling processor is defined in the FROM clause like the following:
table_name: can be a source or a virtual table as a result of an INTO operation.
column_name: is the column to evaluate to define the window, can be in either timestamp or a datetime.
window_size: the size of the timeframe to consider in the format 'N unit_of_time' where unit_of_time can be 'SECOND', 'MINUTE', 'HOUR', 'DAY', 'MONTH', 'YEAR' and N specifies the size of the window.
Take as an example the table "taxi_trips" that consists of these columns: taxi_id, completed_at and distance:
Here is a query that uses the tumble window function.
The result looks like this:
HOP
Hopping windows model scheduled windows that can overlap, this is specified using a the hop_size parameter that specifies how much each window moves forward relative to the previous one.
This is the syntaxt to use the hop window function:
table_name: can be a source or a virtual table as a result of an INTO operation.
column_name: is the column to evaluate to define the window, can be in either timestamp or a datetime.
hop_size: the size of the hop in the format 'N unit_of_time'
window_size: the size of the timeframe to consider in the format 'N unit_of_time'
Using this query on the same dataset above:
this will be the result:
Note that the rows in the result are duplicated based on how many windows are overlapping in a time frame.
IMPLEMENTATION
A Windowing Processor is initialised with the parameters coming from the
FROM
clause parsing,the
RecordFilter
trait is implemented for both Tumble and Hop.The Window processing will happen after the
Source
node and before theProductProcessor
, this means such operation could happen either in a separated processor or inside theProductProcessor
.edit: from comment
Is going to be introduced a WindowProcessor, the correct mapping of the connections from Source to WindowProcessor to ProductProcessor is done during the composition of the DAG after the SQL Parsing.
The Record Store required from the ProductProcessor is going to move from Source to the WindowProcessor, in case Windowing is happening.
The Tumble processing start after receiving the input record, evaluating the WINDOW column and generating the
window_start
andwindow_end
columns accoringly with thewindow_size
parameter.In the case of a Hopping multiple output
Record
s will be generated based onwindow_size
andhop_size
parameters.@getdozer/dozer-dev
Beta Was this translation helpful? Give feedback.
All reactions