(To perform this lab using Glue ETL Jobs and Workflows instead, go here)
A Glue Development Endpoint is an environment for you to develop and test your Glue scripts / jobs. Configuring a Development Endpoint spins up the necessary network and machines to simplify ETL scripting with AWS resources in a VPC.
In this lab, you will be joining two separate dataframes: one from the raw
datasets from Firehose against a manually uploaded reference dataset.
The raw dataset contains list of tracks, devices and activities from Firehose. The reference dataset contains a list of tracks, track titles and artist names.
You will be using Glue to perform basic transformations such as filtering and joining.
In this step, you will upload and crawl a new Glue dataset from a manual JSON file.
-
Open your S3 Bucket *YOUR_USERNAME-datalake-demo-bucket: https://s3.console.aws.amazon.com/s3/home?region=us-east-1#
-
Open the subfolder data, and create a subfolder called reference_data. Your bucket should look like this:
*--YOUR_USERNAME-datalake-demo-bucket │ ├── data/ │ └── raw/ │ └── reference_data/ │ │ └── (..other project assets: code etc.)
-
Download the following file tracks_list.json, and upload it into the
reference_data/
folder. -
Open the Glue crawler console. Select the crawler you have created CrawlDataFromKDG and Run crawler.
- The crawlwer picks up new data in the S3 bucket and automatically creates new tables in the database
- Notice how this creates two new Glue tables for
raw
andreference_data
.
In this step you will be creating a glue endpoint to interactively develop Glue ETL scripts using PySpark.
- GoTo : https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=devEndpoints
- Click - Add endpoint
- Development endpoint name - DevEndpoint1
- IAM role - AWSGlueServiceRoleLab
- Expand - Security configuration.. parameters
- Data processing units (DPUs): 2 (this affects the cost of the running this lab)
- A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory.
- Click - Next
- Networking screen :
- Choose - Skip networking information
- Add an SSH public key (Optional)
- Leave as defaults
- Click: Next
- Review the settings
- Click: Finish
- Development endpoint name - DevEndpoint1
It will take close to 10 mins for the Dev Endpoint to be READY. You have to wait for this step to complete before moving to next step.
In this step, we will launch notebook instances to use as our workspace. We will be using Sagemaker notebook instances in this lab.
- On the navigation on the Left under Dev endpoints, click on Notebooks.
- Select tab : Sagemaker notebooks
- Click: Create notebook
- Notebook name: aws-glue-notebook1
- Attach to development endpoint: devendpoint1
- Choose: Create an IAM role
- IAM Role: AWSGlueServiceSageMakerNotebookRole-default
- VPC (optional): Leave blank
- Encryption key (optional): Leave blank
- Click: Create Notebook
This will take few minutes. Wait for the notebook instance to be Ready. In the meantime, check out the differences between Sagemaker and Zeppelin instances.
-
Download and save this file locally on your laptop: datalake-notebook.ipynb
-
In the Notebooks console, click on the notebook name you have just created: - aws-glue-notebook1
-
Click on Open to launch the web interface for the notebook instance.
- On Sagemaker Jupyter Notebook
- Upload the sample
datalake-notebook.ipynb
you downloaded earlier. - Click on datalake-notebook.ipynb to open the notebook.
- Make sure it says 'Sparkmagic (PySpark)' on top right part of the notebook. This is the name of the kernel Jupyter will use to execute code blocks in this notebook
- Upload the sample
The SageMaker notebook takes you through how to load, transform and write output data using Glue APIs and DynamicFrames. Read and understand the instructions as they explain important Glue concepts.
Once the ETL script has ran successfully, you can inspect the output of the SparkSQL transformations.
- Look into your S3 Bucket: YOUR_USERNAME-datalake-demo-bucket/data/processed_data
- Inspect the new Glue table
processed_data
using Athena
- Explore more built-in transformations provided by Glue: Built-in Transforms
▶️