Implement `Window` capability in Dozer SQL #893

mediuminvader · 2023-02-14T05:57:45Z

mediuminvader
Feb 14, 2023

WINDOWING

A window is a viewport onto a buffer, it gives you a snapshot of a stream within a given timeframe and can be set as hopping or tumbling.

In Dozer a WINDOW is a function that bounds a Record inside a timeframe, generates the columns necessary to windowing and forwards the enriched Record to the Caching layer; The Caching layer will use these columns to apply Windowing operations and eviction policies to the Record.
DozerSQL supports tumbling window with the TUMBLE function and hopping window with the HOP function.

Note: This document will consider only TIME Window.

TUMBLE()

The window is referred to as a tumbling window when each window is processed in a non-overlapping manner. In this case, each record on a stream belongs to a specific window and it is processed only once.

Tumbling processor is defined in the FROM clause like the following:

SELECT ...
FROM TUMBLE(table_name, column_name, window_size);

table_name: can be a source or a virtual table as a result of an INTO operation.
column_name: is the column to evaluate to define the window, can be in either timestamp or a datetime.
window_size: the size of the timeframe to consider in the format 'N unit_of_time' where unit_of_time can be 'SECOND', 'MINUTE', 'HOUR', 'DAY', 'MONTH', 'YEAR' and N specifies the size of the window.

Take as an example the table "taxi_trips" that consists of these columns: taxi_id, completed_at and distance:

taxi_id	completed_at	distance
1001	2023-02-01 22:00:00	4
1002	2023-02-01 22:01:00	6
1003	2023-02-01 22:02:00	3
1004	2023-02-01 22:03:00	7
1005	2023-02-01 22:05:00	2
1006	2023-02-01 22:05:30	8

Here is a query that uses the tumble window function.

SELECT taxi_id, completed_at, window_start, window_end 
FROM TUMBLE ('taxi_trips', 'completed_at', '2 MINUTES');

The result looks like this:

taxi_id	completed_at	window_start	window_end
1001	2023-02-01 20:00:00	2023-02-01 20:00:00	2023-02-01 20:02:00
1002	2023-02-01 20:01:00	2023-02-01 20:00:00	2023-02-01 20:02:00
1003	2023-02-01 20:02:10	2023-02-01 20:02:00	2023-02-01 20:04:00
1004	2023-02-01 20:03:00	2023-02-01 20:02:00	2023-02-01 20:04:00
1005	2023-02-01 20:05:00	2023-02-01 20:04:00	2023-02-01 20:06:00
1006	2023-02-01 20:06:00	2023-02-01 20:06:00	2023-02-01 20:08:00

HOP

Hopping windows model scheduled windows that can overlap, this is specified using a the hop_size parameter that specifies how much each window moves forward relative to the previous one.

This is the syntaxt to use the hop window function:

SELECT ...
FROM HOP(table_name, column_name, hop_size, window_size);

table_name: can be a source or a virtual table as a result of an INTO operation.
column_name: is the column to evaluate to define the window, can be in either timestamp or a datetime.
hop_size: the size of the hop in the format 'N unit_of_time'
window_size: the size of the timeframe to consider in the format 'N unit_of_time'

Using this query on the same dataset above:

SELECT taxi_id, completed_at, window_start, window_end
FROM HOP('taxi_trips', 'completed_at', '1 MINUTE', '2 MINUTES')

this will be the result:

taxi_id	completed_at	window_start	window_end
1001	2023-02-01 20:00:00	2023-02-01 19:59:00	2023-02-01 20:01:00
1002	2023-02-01 20:01:00	2023-02-01 20:00:00	2023-02-01 20:02:00
1001	2023-02-01 20:00:00	2023-02-01 20:00:00	2023-02-01 20:02:00
1003	2023-02-01 20:02:10	2023-02-01 20:01:00	2023-02-01 20:03:00
1002	2023-02-01 20:01:00	2023-02-01 20:01:00	2023-02-01 20:03:00
1004	2023-02-01 20:03:00	2023-02-01 20:02:00	2023-02-01 20:04:00
1003	2023-02-01 20:02:10	2023-02-01 20:02:00	2023-02-01 20:04:00
1004	2023-02-01 20:03:00	2023-02-01 20:03:00	2023-02-01 20:05:00
1005	2023-02-01 20:05:00	2023-02-01 20:04:00	2023-02-01 20:06:00
1006	2023-02-01 20:06:00	2023-02-01 20:05:00	2023-02-01 20:07:00
1005	2023-02-01 20:05:00	2023-02-01 20:05:00	2023-02-01 20:07:00
1006	2023-02-01 20:06:00	2023-02-01 20:06:00	2023-02-01 20:08:00

Note that the rows in the result are duplicated based on how many windows are overlapping in a time frame.

IMPLEMENTATION

A Windowing Processor is initialised with the parameters coming from the FROM clause parsing,
the RecordFilter trait is implemented for both Tumble and Hop.

struct TumbleWindow {
    column: u16,
    interval: Interval,
}

struct HopWindow {
    column: u16,
    hop_size: Interval,
    interval: Interval,
}

trait WindowProcessor {
    fn apply(record: &Record) -> Result<Vec<Record>, PipelineError>;
    fn get_output_schema(schema: $Schema) -> Result<Schema, PipelineError>;
}

Note: The same trait could be useful for other Record operators, could be renamed to something like RecordFilter.

The Window processing will happen after the Source node and before the ProductProcessor, this means such operation could happen either in a separated processor or inside the ProductProcessor.

edit: from comment
Is going to be introduced a WindowProcessor, the correct mapping of the connections from Source to WindowProcessor to ProductProcessor is done during the composition of the DAG after the SQL Parsing.
The Record Store required from the ProductProcessor is going to move from Source to the WindowProcessor, in case Windowing is happening.

Note: The SQL parser might require to be extended to support WINDOW functions syntax.

The Tumble processing start after receiving the input record, evaluating the WINDOW column and generating the window_start and window_end columns accoringly with the window_size parameter.
In the case of a Hopping multiple output Records will be generated based on window_size and hop_size parameters.

@getdozer/dozer-dev

chubei · 2023-02-14T06:13:26Z

chubei
Feb 14, 2023

I'm wondering if these functions exist in other Sql dialects or we just invented them.

Questions:

How does hopping window affect the primary key?
Is tumbling window a special case of hopping window?
Why we can't use the record readers from the window processor, if window processor is implemented separately?

8 replies

mediuminvader Feb 14, 2023
Author

One problem could be data replication in this case, two queries using Windowing will duplicate data, currently the Record history is mostly in the source to avoid data duplication.

chubei Feb 14, 2023

If two queries both using windowing with the same input, can they share the same window processor so there's no data duplicate?

mediuminvader Feb 14, 2023
Author

I guess that must use the same windowing parameters for both windows.
Or implement a back propagation of the Record history request, but that's require to re-execute the windoing every time.

chubei Feb 14, 2023

Interesting. I guess we'll encounter a case where product processor's input are other processors' output sooner or later?

mediuminvader Feb 14, 2023
Author

That’s what I’m afraid of, too.
I wonder if it makes more sense to have stateful input ports, rather then outputs.

chubei · 2023-04-26T09:39:21Z

chubei
Apr 26, 2023

Should we close this discussion?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Window` capability in Dozer SQL #893

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Implement Window capability in Dozer SQL #893

mediuminvader Feb 14, 2023

WINDOWING

TUMBLE()

HOP

IMPLEMENTATION

Replies: 2 comments · 8 replies

chubei Feb 14, 2023

mediuminvader Feb 14, 2023 Author

chubei Feb 14, 2023

mediuminvader Feb 14, 2023 Author

chubei Feb 14, 2023

mediuminvader Feb 14, 2023 Author

chubei Apr 26, 2023

Implement `Window` capability in Dozer SQL #893

mediuminvader
Feb 14, 2023

Replies: 2 comments 8 replies

chubei
Feb 14, 2023

mediuminvader Feb 14, 2023
Author

mediuminvader Feb 14, 2023
Author

mediuminvader Feb 14, 2023
Author

chubei
Apr 26, 2023