We are looking for a high quality data engineer which can deliver comprehensive solutions for our continuity and business growth.
The Analytics team drives the data culture, we want to change how we produce data from large batches to micro batching, from daily to near real-time/streaming processing, from tabular reports to insightful dashboards.
You can be part of an amazing team which deals with data all the time using different process, tools and technologies.
Following is a little treasure and challenge for those keen on joining this amazing company and team.
For a Junior/Mid role we are expecting at least 2-3 tables to be loaded and an aggregated report done.
We are expecting the most from you.
Build a small ETL process to digest a few set of files into a data warehouse like project.
We are expecting an end-to-end ETL solution to deliver a simple star schema which an end user can easily slice and dice the data through a report or using basic ad-hoc query.
We are a Python and SQL workshop, we would like to see this project using just those tools.
However, we are open to other tools and technologies if we are able to easily replicate on our side.
For the database, use a simple and light optimizer for your database, choose the one which can run a browser, but don't be limited to it.
Please, avoid licensed products, we may not be able to proceed with this restriction on our own, if this is the case you may need to book a meeting to bring your tool and demo to us.
Fork this repo, build your ETL process and commit the code with your answers. Open a Pull Request and send us a message highlighting the test is completed.
- it must come with step by step instructions to run the code.
- please, be mindful that your code might be moved or deleted after we analyse the PR.
- use the best practices
- be able to explain from the ground up the whole process on face to face interview
- The data for this exercise can be found on the
data.zip
file. Can you describe the file format?
Super Bonus: generate your own data through the instructions on the encoded file bonus_etl_data_gen.txt
.
To get the bonus points, please encoded the file with the instructions were used to generate the files.
-
Code you scripts to load the data into a database.
-
Design a star schema model which the data should flow.
-
Build your process to load the data into the star schema
Bonus point:
- add a fields to classify the customer account balance in 3 groups
- add revenue per line item
- convert the dates to be distributed over the last 2 years
- How to schedule this process to run multiple times per day?
Bonus: What to do if the data arrives in random order and times via streaming?
- How to deploy this code?
Bonus: Can you make it to run on a container like process (Docker)?
One of the most important aspects to build a DWH is to deliver insights to end-users.
Can you using the designed star schema (or if you prefer the raw data), generate SQL statements to answer the following questions:
-
What are the top 5 nations in terms of revenue?
-
From the top 5 nations, what is the most common shipping mode?
-
What are the top selling months?
-
Who are the top customer in terms of revenue and/or quantity?
-
Compare the sales revenue of on current period against previous period?
Data profiling are bonus.
What tools or techniques you would use to profile the data?
What results of the data profiling can impact on your analysis and design?
If this pipeline is to be build for a real live environment. What would be your recommendations in terms of tools and process?
Would be a problem if the data from the source system is growing at 6.1-12.7% rate a month?
Author: adilsonmendonca