In this example we demonstrate Dozer's capability of processing large volume of data.
- Scaling E-Commerce
- Table of Contents
- Data Schema and Volume
- Instance Type
- Experiment 1
- Experiment 2
- API Performance
Running instructions can be found here
Let's consider the following schema. The data source has 4 tables: orders
, order_items
, products
, customers
.
Data has been generated using dbldatagen
.
We generate 11 million rows for customers, 11 million rows for orders, 10 million rows for products, and 100 million rows for order_items. These parameters can be adjusted in generate.py
.
Table | No of Rows |
---|---|
customers | 11_000_000 |
orders | 11_000_000 |
order_items | 100_000_000 |
products | 16_000_000 |
Following tests have been run on AWS Graviton m7g.8xlarge
type. Potentially storage optimized instances such as im4gn.large
could be used if high amount of disk reads are to be expected in the case of external storage.
Instance Type | vCPUs | Memory(Gb) |
---|---|---|
m7g.8xlarge | 32 | 128 |
Running dozer
direct from source
to cache
.
dozer clean -c direct-config.yaml
dozer build -c direct-config.yaml
dozer run app -c direct-config.yaml
- Roughly took
4 mins
to process all the records. - Note that processing of
customers
,orders
andorder_items
finished in about2 mins
compared toproducts
. - Pipeline latency is very low (
~0.04
) as there is no transformation involved.
Start Time | End Time | Elapsed |
---|---|---|
3:00:50 PM | 3:04:38 PM | ~ 4 mins |
Running dozer
with aggregations and joins.
We run 3 cascading JOINs and a COUNT aggregation on the data source. The sql can be found in aggregate-config.yaml
.
select c.customer_id, c.name, c.email, o.order_id, o.order_date, o.total_amount, COUNT(*)
into customer_orders
from customers c
inner join orders o on c.customer_id = o.customer_id
join order_items i on o.order_id = i.order_id
join products p on i.product_id = p.product_id
group by c.customer_id, c.name, c.email, o.order_id, o.order_date, o.total_amount
dozer clean -c aggregate-config.yaml
dozer build -c aggregate-config.yaml
dozer run app -c aggregate-config.yaml
- Roughly took
12 mins
to process all the records. - Note that here total number of
order_items
increases in conjunction withproducts
. This is to due to the dependency of the join. - Pipeline latency stays under
1s
even with 4 joins and an aggregation.
Start Time | End Time | Elapsed |
---|---|---|
2:32:48 PM | 2:44:51 PM | ~ 12 mins |
Dozer really shines when it comes to API performance as views are pre-materialized.
Dozer automatically generates gRPC
and REST
APIs.
Lets use ghz
to run a loadtest against the gRPC
server. You can find the script here
HOST=localhost:50051
TOTAL=100000
CONCURRENCY=50
echo "Querying count of customers: $TOTAL requests and $CONCURRENCY concurrency"
ghz --insecure --proto .dozer/api/customers/v0001/common.proto --call dozer.common.CommonGrpcService.query --total $TOTAL --concurrency $CONCURRENCY --data '{"endpoint":"customers"}' $HOST
Dozer maintains an average of 4.92 ms
at a very high throughput of 10000
total requests at 50
concurrency.
Summary:
Count: 100000
Total: 10.83 s
Slowest: 30.56 ms
Fastest: 1.27 ms
Average: 4.92 ms
Requests/sec: 9234.20
Response time histogram:
1.268 [1] |
4.197 [48567] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
7.126 [36274] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
10.056 [10248] |∎∎∎∎∎∎∎∎
12.985 [3135] |∎∎∎
15.915 [1152] |∎
18.844 [414] |
21.773 [104] |
24.703 [73] |
27.632 [26] |
30.561 [6] |
Latency distribution:
10 % in 2.44 ms
25 % in 3.18 ms
50 % in 4.27 ms
75 % in 5.92 ms
90 % in 8.11 ms
95 % in 10.00 ms
99 % in 14.63 ms
Status code distribution:
[OK] 100000 responses