A Slalom DataOps Lab
Table of Contents:
- One-time setup:
- Installed software via:
- Core DevOps Tools: http://docs.dataops.tk/setup
- AWS CLI:
choco install awscli
(Windows)brew install awscli
(Mac)
- Tapdance extract tool:
pip3 install tapdance
- Installed software via:
- Environment setup:
- Open browser tabs:
- The lab checklist (this page)
- linux academy
- slalom-ggp/dataops-project-template
- Open browser tabs:
Option A: Start from Previous Lab (Recommended):
Use this option if you've already completed the previous lab, and have successfully run
terraform apply
.
If you've completed the previous lab, and if you used the recommended Linux Academy Playground to do so, you're 4-hour limited AWS environment has likely been reset. Follow these instructions to get a new environment and reset your git repo to use this new account.
- Create a new AWS Sandbox environment at playground.linuxacademy.com.
- Update your credentials file (
.secrets/aws-credentials
) with the newly provided Access Key ID and Secret Access Key. - Navigate to your
infra
folder and rename theterraform.tfstate
file toterraform.tfstate.old
.- IMPORTANT: In a real-world environment, you never want to delete or corrupt your "tfstate" file, since the state file is Terraform's way of tracking the resources it is responsible for. In this unique case, however, our environment has been already been purged by the Linux Academy 4-hour time limit, and by renaming this file we are able to start fresh in a new account.
That's it! You should now be able to run terraform apply
again, which will
recreate the same baseline environment you created in the previous
"data-lake" lab.
Option B: Starting from Scratch:
If you've not yet completed the data lake lab, go back and do so now. (Skip all exercises labeled "Extra Credit".)
For this lab, you will be extracting data from the Covid 19 tap tap-covid-19
as
documented here. Before you start, take a
moment to become familiar with the provided documentation.
As documented on the Covid-19 Quick Start link
here, the covid-19
tap requires at minimum the following settings:
api_token
- This is a GitHub token you will create which allows the tap to authenticate on your behalf and extract data from the John Hopkins Covid-19 repo.user_agent
- This is simply any name and email address (e.g.tap-covid-19 <api_user_email@your_company.com>
), which is again used to identify the tap as it uses the GitHub API to extract source data from the upstream source.start_date
- A datetime stamp (e.g.2019-01-01T00:00:00Z
) which is the earliest date from which to start extracting data from the source.
In this section, you will create a new GitHub token. This token allows the tap to authenticate as you whenever it reads data from the Covid-19 dataset. Since it only needs to read data from a public repo, we can give the token restricted permissions so that it can only perform read-only actions.
- Navigate to your GitHub Tokens screen in GitHub.
- Generate a new Token and grant the token
Read
access onPublic Repos
. In the 'Note' space, you can provide any text. For example,Covid-19 Extracts
orCloud DE lab
. - Once created, note the new key shown on the screen - you will need this key to complete the next step.
-
Navigate to the folder
data/taps/.secrets
and create a file calledtap-covid-19-config.json
with the following text:{ "api_token": "YOUR_GITHUB_API_TOKEN", "start_date": "2019-01-01T00:00:00Z", "user_agent": "tap-covid-19 <api_user_email@your_company.com>" }
-
Paste your new github token into the
api_token
field. -
Override the second part of the
user_agent
string, replacing the sample email with your own email (for purposes of identification during data extraction). -
Important: Don't forget to save the new file (
Ctrl+s
orFile
>Save
).
Note: It is important this this file - and any other file containing secret keys - should not be committed to source control. In VS Code, you con confirm that your new file should appear grey in the file explorer and it should not appear in the Git changes panel. This exclusion from git is accomplished due to the fact that contents of
.secrets
folder are explicitly and automatically excluded from Git using the.gitignore
file, which is stored in the root of your repo.
The tapdance
extract tool is a wrapper for the open source Singer.io taps and targets framework, and it requires a rules file to specify which tables and fields should be captured. In this section, you will create a very simple rules file and then you will
test the rules by creating a plan and confirming the details of that plan file.
-
Create a simple rules file at
data/taps/covid-19.rules.txt
and paste in the following single line:*.*
- This tells the
tapdance
tool we want to pull all tables and all columns from thecovid-19
tap.
- This tells the
-
Open a terminal in the folder
data/tap
by right-clicking the folder and selecting "Open in Integrated Terminal". -
Run
tapdance plan covid-19
in the new terminal to update the extract plan file- Important: If docker is not working on your machine, you do not have to complete this step. You can safely continue on to step 3.
- If you do not have docker setup, or if docker is not able to access your local
filesystem. First confirm docker is installed (
choco install docker-desktop
orbrew install docker
) and then check the Troubleshooting guide if you still receive an error.
-
Open the new file
tap-covid-19-plan.yml
and review the contents. You should see a list of to-be-extracted tables and columns.- Since we specified
*.*
in our rules file, the resulting plan will include all tables and all columns.
- Since we specified
-
Note that all of the tables contain a column called
git_owner
and a column calledgit_html_url
- neither of which is needed for our analysis. -
Re-open the file
data/taps/data.select
and add the following three lines to the bottom of the file:# Let's exclude the extra columns we don't need: !*.git_owner !*.git_html_url
- The "
!
" symbols at the beginning of each line tells tapdance that we want to exclude any columns that match these rules. - The
*
symbol in the second part of each rule specifies that we want this exclusion to be performed for any and all tables from thecovid-19
source that might have these columns. - The
#
symbol indicates a comment - which makes your file easier to understand, but otherwise does not change any functionality.
- The "
-
Finally, ru-run the command
tapdance plan covid-19
and confirm the column exclusions in the extract plan file.
In this section, we will use a module from the Slalom Infrastructure Catalog which can
orchestrate Singer taps on AWS using ECS. We will need to provide the terraform module the
details on (1) where our tap configuration is stored (the data/taps
folder), (2) what
credentials or secrets should be used during execution, (3) on what frequency the extracts
should be run, and (4) which target should be used to land the data.
- Make sure your AWS credentials in
.secrets/aws-credentials
are correctly configured for Terraform.- If using Linux Academy AWS Sandbox:
- You may need to refresh your Sandbox environment and paste new credentials into the
aws-credentials
file, which is inside the.secrets
folder at the root of the repo. For a refresher, see the instructions in the previous lab. - If you've already run
terraform apply
and your sandbox has expired, you will need to delete the state file to avoid running into errors when trying to access now-expired credentials and resources. The state file is calledterraform.tfstate
and will be inside theinfra
folder.
- You may need to refresh your Sandbox environment and paste new credentials into the
- If using Linux Academy AWS Sandbox:
- Copy the sample
02_singer-taps.tf
file from the template project here into yourinfra
folder. - In the space provided for for
tap_id
(line 3), enter the text "covid-19". This indicates the source plugintap-covid-19
, which maps to this repo. - In the
secrets
section of the configuration, replace 'username' and 'password' with the name of two secrets in our json config file:user_agent
andapi_token
, respectively. - Confirm your file against the sample here.
- Open a terminal in the
infra
folder and runterraform init --upgrade
. The--upgrade
flag tells Terraform to pull down the latest versions of all upstream modules. - Run
terraform apply
to deploy the infrastructure.
At this point, your infrastructure has deployed successfully to AWS using Terraform. According to the schedule defined, it will run automatically each day. However, instead of waiting until the next execution, we will now kick off the extract manually using ECS and the AWS command line ("awscli").
- Open a new Terminal in the
infra
folder (or reuse an existing one). - If you are on Windows and your terminal is a PowerShell terminal, type
cmd
into the terminal and and press<enter>
to switch back to Command Prompt. (PowerShell is not yet supported for the next "User Switch" step.) - Run
terraform output
from theinfra
folder in order to print out the Terraforms output variables. - From the terraform output, copy-paste and then run the provided
AWS User Switch
command. (This helps the AWS CLI locate our credentials.) - Copy-paste and run the
Sync command
from the Terraform output. This will manually execute the sync (extract and load) in ECS. - Click on the
Logging URL
link in the Terraform output to open CloudWatch logs in a web browser. - Wait one or two minutes for the ECS task to start.
- If the job kicks off successfully, you will see logs begin to appear after 1-2 minutes.
- Once logs are coming through into CloudWatch, navigate to the S3 Service in your AWS Web Console and open the data bucket. Explore the new data files and folder which are landing there in S3.
At this point, your infrastructure has deployed successfully to AWS using Terraform. In this optional exercise, you'll explore more deeply and review what has been deployed.
- Open a new Terminal in the
infra
folder (or reuse an existing one). - Run
terraform show
to print a full output of the tracked Terraform resources. - Search (Ctrl+f) to find each of the following AWS resource types:
aws_s3_bucket
aws_iam_role
aws_subnet
- In the file explorer, navigate to the file
terraform.tfstate
in theinfra
folder and open it. Search again for the above components, this time from within the state file. - Once you are done exploring, close the
terraform.tfstate
file and note that the file is greyed out in the file explorer. Just as with the secrets files we created, this file is likewise automatically excluded from git based upon rules in our project's.gitignore
file.
Congratulations! You've created a fully automated data extract pipeline in an hour or less, which can now be instantly deployed to any AWS environment. If new columns and tables are added to our source in the future, they will automatically be included so long as they match our specified rules. If columns are dropped or renamed, the data may become out of sync with our plan file, but the extraction itself will continue running regardless of upstream changes. Similarly, we've avoided any hard-coding of data types and our process will therefor be resilient to any upstream data type changes which may occur in the future.
For troubleshooting tips, please see the Lab Troubleshooting Guide.