dozer-samples/usecases/scaling-ecommerce at main · getdozer/dozer-samples

History

Name		Name	Last commit message	Last commit date
parent directory ..
images		images
prometheus		prometheus
.gitignore		.gitignore
README.md		README.md
aggregate-config.yaml		aggregate-config.yaml
direct-config.yaml		direct-config.yaml
docker-compose.yml		docker-compose.yml
generate.py		generate.py
half-config.yaml		half-config.yaml
join-config.yaml		join-config.yaml
load_test_grpc.sh		load_test_grpc.sh
local-config.yaml		local-config.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
running.md		running.md

README.md

Scaling E-Commerce

In this example we demonstrate Dozer's capability of processing large volume of data.

Data Schema and Volume

Let's consider the following schema. The data source has 4 tables: orders, order_items, products, customers.

Data has been generated using dbldatagen. We generate 11 million rows for customers, 11 million rows for orders, 10 million rows for products, and 100 million rows for order_items. These parameters can be adjusted in generate.py.

Table	No of Rows
customers	11_000_000
orders	11_000_000
order_items	100_000_000
products	16_000_000

Instance Type

Following tests have been run on AWS Graviton m7g.8xlarge type. Potentially storage optimized instances such as im4gn.large could be used if high amount of disk reads are to be expected in the case of external storage.

Instance Type	vCPUs	Memory(Gb)
m7g.8xlarge	32	128

Experiment 1

Running dozer direct from source to cache.

Instructions

dozer clean -c direct-config.yaml
dozer build -c direct-config.yaml
dozer run app -c direct-config.yaml

Findings

Roughly took 4 mins to process all the records.
Note that processing of customers, orders and order_items finished in about 2 mins compared to products.
Pipeline latency is very low (~0.04) as there is no transformation involved.

Start Time	End Time	Elapsed
3:00:50 PM	3:04:38 PM	~ 4 mins

Experiment 2

Running dozer with aggregations and joins.

We run 3 cascading JOINs and a COUNT aggregation on the data source. The sql can be found in aggregate-config.yaml.

select c.customer_id, c.name, c.email,  o.order_id, o.order_date, o.total_amount, COUNT(*)
  into customer_orders 
  from customers c
  inner join orders o on c.customer_id = o.customer_id
  join order_items i on o.order_id = i.order_id
  join products p on i.product_id = p.product_id
  group by c.customer_id, c.name, c.email, o.order_id, o.order_date, o.total_amount

Instructions

dozer clean -c aggregate-config.yaml
dozer build -c aggregate-config.yaml
dozer run app -c aggregate-config.yaml

Findings

Roughly took 12 mins to process all the records.
Note that here total number of order_items increases in conjunction with products. This is to due to the dependency of the join.
Pipeline latency stays under 1s even with 4 joins and an aggregation.

Start Time	End Time	Elapsed
2:32:48 PM	2:44:51 PM	~ 12 mins

API Performance

Dozer really shines when it comes to API performance as views are pre-materialized. Dozer automatically generates gRPC and REST APIs.

Lets use ghz to run a loadtest against the gRPC server. You can find the script here

Instructions

HOST=localhost:50051
TOTAL=100000
CONCURRENCY=50
echo "Querying count of customers:  $TOTAL requests and $CONCURRENCY concurrency"
ghz --insecure --proto .dozer/api/customers/v0001/common.proto --call dozer.common.CommonGrpcService.query --total $TOTAL --concurrency $CONCURRENCY --data '{"endpoint":"customers"}' $HOST

Findings

Dozer maintains an average of 4.92 ms at a very high throughput of 10000 total requests at 50 concurrency.

Summary:
  Count:	100000
  Total:	10.83 s
  Slowest:	30.56 ms
  Fastest:	1.27 ms
  Average:	4.92 ms
  Requests/sec:	9234.20

Response time histogram:
  1.268  [1]     |
  4.197  [48567] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  7.126  [36274] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  10.056 [10248] |∎∎∎∎∎∎∎∎
  12.985 [3135]  |∎∎∎
  15.915 [1152]  |∎
  18.844 [414]   |
  21.773 [104]   |
  24.703 [73]    |
  27.632 [26]    |
  30.561 [6]     |

Latency distribution:
  10 % in 2.44 ms
  25 % in 3.18 ms
  50 % in 4.27 ms
  75 % in 5.92 ms
  90 % in 8.11 ms
  95 % in 10.00 ms
  99 % in 14.63 ms

Status code distribution:
  [OK]   100000 responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling-ecommerce

scaling-ecommerce

README.md

Scaling E-Commerce

Table of Contents

Data Schema and Volume

Instance Type

Experiment 1

Instructions

Findings

Experiment 2

Instructions

Findings

API Performance

Instructions

Findings

Files

scaling-ecommerce

Directory actions

More options

Directory actions

More options

Latest commit

History

scaling-ecommerce

Folders and files

parent directory

README.md

Scaling E-Commerce

Table of Contents

Data Schema and Volume

Instance Type

Experiment 1

Instructions

Findings

Experiment 2

Instructions

Findings

API Performance

Instructions

Findings