diff --git a/README.md b/README.md index fceba015..42b771b0 100644 --- a/README.md +++ b/README.md @@ -40,6 +40,7 @@ Refer to the [Installation section](https://getdozer.io/docs/installation) for i | | [Using Sub queries](./sql/sub-queries) | How to use sub queries in Dozer | | | [Using UNIONs](./sql/union) | How to combine data using `UNION` in Dozer | | | [Using Window Functions](./sql/window-functions) | Use `Hop` and `Tumble` Windows | +| | [Using TTL](./sql/ttl) | Use `TTL` to manage memory usage | | | | | | Use Cases | [Flight Microservices](./usecases/pg-flights) | Build APIs over multiple microservices. | | | [Scaling Ecommerce](./usecases/scaling-ecommerce) | Profile and benchmark Dozer using an ecommerce data set | diff --git a/sql/README.md b/sql/README.md index abb2bd40..8a8b449f 100644 --- a/sql/README.md +++ b/sql/README.md @@ -2,7 +2,7 @@ This is a comprehensive guide showcasing different types of queries possible with Dozer SQL. -## Dataset +## Dataset We will be using two tables throughout this guide. These tables are from [NYC - TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). To download these run the command, @@ -16,8 +16,7 @@ This table is contained in a parquet file under `data/trips/fhvhv_tripdata_2022- ![table_1_image](/sql/images/table_1.png) - -### Table 2: taxi_zone_lookup +### Table 2: taxi_zone_lookup This table is contained in a csv file under `data/zones/taxi_zone_lookup.csv`. @@ -36,20 +35,22 @@ Hence, the basic statement structure is, ```sql SELECT A INTO C FROM B; ``` + The datatypes and casting compatible with Dozer SQL are described in the [documentation for datatypes and casting](https://getdozer.io/docs/transforming-data/data-types). Dozer SQL also supports primitive scalar function described in [documentation for scalar functions](https://getdozer.io/docs/transforming-data/scalar-functions). ## Table of contents -Let us start with basic Dozer SQL queries and move towards more complex queries. - -| Sr.no | Query type | Description | -| ----- | ---------- | -------------------------------------------------------------------- | -| 1 | [Filtering](./filtering/README.md) | A simple select operation with a `WHERE` clause | -| 2 | [Aggregation](./aggregation/README.md) | Multiple queries each describing a specifc aggregation on the data | -| 3 | [JOIN](./join/README.md) | Query to JOIN the tables based on `LocationID` | -| 4 | [CTEs](./cte/README.md) | Query with two CTE tables JOINed after filtering | -| 5 | [Sub queries](./sub-queries/README.md) | Multiple queries describing nested `SELECT` statements | -| 6 | [UNION](./union/README.md) | A `UNION` peformed inside a CTE, followed by a `JOIN` | -| 7 | [Window functions](./window-functions/README.md) | Queries describing the use of `TUMBLE` and `HOP` | \ No newline at end of file +Let us start with basic Dozer SQL queries and move towards more complex queries. + +| Sr.no | Query type | Description | +| ----- | ------------------------------------------------ | ------------------------------------------------------------------ | +| 1 | [Filtering](./filtering/README.md) | A simple select operation with a `WHERE` clause | +| 2 | [Aggregation](./aggregation/README.md) | Multiple queries each describing a specifc aggregation on the data | +| 3 | [JOIN](./join/README.md) | Query to JOIN the tables based on `LocationID` | +| 4 | [CTEs](./cte/README.md) | Query with two CTE tables JOINed after filtering | +| 5 | [Sub queries](./sub-queries/README.md) | Multiple queries describing nested `SELECT` statements | +| 6 | [UNION](./union/README.md) | A `UNION` peformed inside a CTE, followed by a `JOIN` | +| 7 | [Window functions](./window-functions/README.md) | Queries describing the use of `TUMBLE` and `HOP` | +| 8 | [TTL](./ttl/README.md) | Queries describing the use of `TTL` | diff --git a/sql/images/ttl_graph.png b/sql/images/ttl_graph.png new file mode 100644 index 00000000..1272c86d Binary files /dev/null and b/sql/images/ttl_graph.png differ diff --git a/sql/ttl/README.md b/sql/ttl/README.md new file mode 100644 index 00000000..d15c0ba2 --- /dev/null +++ b/sql/ttl/README.md @@ -0,0 +1,107 @@ +# TTL function example + +This example shows how to use the Time To Live(TTL) function using Dozer SQL. + +The TTL function provides a way to manage the memory usage in Dozer, particularly when dealing with vast streams of data. By setting up a TTL, it ensures that only relevant (or recent) data is held in memory, providing a balance between data retention and memory efficiency. TTL is based on the record's timestamp, ensuring that data eviction is contextually relevant. + +To read more about window functions read the [documentation](https://getdozer.io/docs/transforming-data/windowing#ttl). + +Here we describe two queries that will only use fresh data obtained over a 5 minute window, + +- Query to calculate the sum of tips obtained for a particular Pickup location over a 2 minutes window. + +- Query to calculate the sum of tips obtained for a particular Pickup location over a 3 minutes window but the windows overlap by 1 minutes. + i.e. the 3 minutes is divided into, + - 1 minutes overlapping with past window + - 1 minute non overlapping + - 1 minutes overlapping with next window + +## SQL Query and Structure + +### Query 1 + +```sql + SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end + INTO table1 + FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t + GROUP BY t.PULocationID, t.window_start, t.window_end; +``` + +### Query 2 + +```sql + SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end + INTO table2 + FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t + GROUP BY t.PULocationID, t.window_start, t.window_end; +``` + +![ttl_graph](../images/ttl_graph.png) + +## Running + +### Dozer + +To run Dozer navigate to the join folder `/sql/ttl` & use the following command + +```bash +dozer run +``` + +To remove the cache directory, use + +```bash +dozer clean +``` + +### Dozer Live + +To run with Dozer live, replace `run` with `live` + +```bash +dozer live +``` + +Dozer live automatically deletes the cache upon stopping the program. + +## Querying Dozer + +Dozer API lets us use `filter`,`limit`,`order_by` and `skip` at the endpoints. For this example lets order the data in descending order of the sum of `tips`. + +Execute the following commands over bash to get the results from `REST` and `gRPC` APIs. + +### Query 1 + +**`REST`** + +```bash +curl -X POST http://localhost:8080/tumble_ttl/query \ +--header 'Content-Type: application/json' \ +--data-raw '{"$order_by": {"total_tips": "desc"}}' +``` + +**`gRPC`** + +```bash +grpcurl -d '{"endpoint": "tumble_ttl", "query": "{\"$order_by\": {\"total_tips\": \"desc\"}}"}' \ +-plaintext localhost:50051 \ +dozer.common.CommonGrpcService/query +``` + +### Query 2 + +**`REST`** + +```bash +curl -X POST http://localhost:8080/hop_ttl/query \ +--header 'Content-Type: application/json' \ +--data-raw '{"$order_by": {"total_tips": "desc"}}' +``` + +**`gRPC`** + +```bash +grpcurl -d '{"endpoint": "hop_ttl", "query": "{\"$order_by\": {\"total_tips\": \"desc\"}}"}' \ +-plaintext localhost:50051 \ +dozer.common.CommonGrpcService/query +``` diff --git a/sql/ttl/dozer-config.yaml b/sql/ttl/dozer-config.yaml new file mode 100644 index 00000000..ed4bcca4 --- /dev/null +++ b/sql/ttl/dozer-config.yaml @@ -0,0 +1,46 @@ +app_name: ttl-sample +version: 1 + +connections: + - config: !LocalStorage + details: + path: ../data + tables: + - !Table + name: trips + config: !Parquet + path: trips + extension: .parquet + name: ny_taxi + +sources: + - name: trips + table_name: trips + connection: ny_taxi + +sql: | + + -- get the total tips for each location in a 2 minute window + + SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end + INTO table1 + FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t + GROUP BY t.PULocationID, t.window_start, t.window_end; + + -- get the total tips for each location where every window of 3 minutes overlaps by 1 minutes + + SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end + INTO table2 + FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t + GROUP BY t.PULocationID, t.window_start, t.window_end; + +endpoints: + - name: tumble_ttl + path: /tumble_ttl + table_name: table1 + + - name: hop_ttl + path: /hop_ttl + table_name: table2 + +cache_max_map_size: 2147483648