Skip to content

Latest commit

 

History

History
454 lines (358 loc) · 17.1 KB

File metadata and controls

454 lines (358 loc) · 17.1 KB

Building Serverless Data Lakes on AWS

Author: Unni Pillai | Amazon Web Services | Twitter | Linkedin

Architecture Diagram

Learning outcomes from this workshop?

  • Design serverless data lake architecture
  • Build a data processing pipeline and Data Lake using Amazon S3 for storing data
  • Use Amazon Kinesis for real-time streaming data
  • Use AWS Glue to automatically catalog datasets
  • Run interactive ETL scripts in an Amazon SageMaker Jupyter notebook connected to an AWS Glue development endpoint
  • Query data using Amazon Athena & visualize it using Amazon QuickSight

Pre-requisites:

  • You need to have access to an AWS account with AdminstratorAccess
  • This lab should be executed in us-east-1 region
  • Best is to follow links from this guide & open them in new a tab
  • Run this lab in a modern browser

Part 1: Ingest and Storage

Create S3 Bucket

In this step, we will navigate to S3 Console and create the S3 bucket used throughout this demo.

Login to AWS Console: https://console.aws.amazon.com/console/home?region=us-east-1

Navigate to S3 Console & Create a new bucket in us-east-1 region :

  • GoTo : https://s3.console.aws.amazon.com/s3/home?region=us-east-1
  • Click - Create Bucket
    • Bucket Name : yourname-datalake-demo-bucket
    • Region : US EAST (N. Virginia)
    • Click Create (bottom left)
  • Adding reference data
    • Open - yourname-datalake-demo-bucket
      • Click - Create folder
        • New folder called : data
        • Click - Save
    • Open - data
      • Click - Create folder
        • New folder : reference_data
        • Click - Save
    • Open - reference_data
      • download this file locally : tracks_list.json
      • Click - Upload
        • Click Add files & upload the tracks_list.json file here
        • Click Upload (bottom left)

Create Kinesis Firehose

In this step we will create navigate to Kinesis Console & create a Kinesis Firehose delivery stream to ingest data & store in S3:

  • GoTo: https://console.aws.amazon.com/kinesis/home?region=us-east-1#/get-started
  • Click Create Delivery Stream
    • Step 1: Name and source
      • Delivery stream name : sg-summit-demo-stream
      • Source : Direct PUT or other sources
      • Click Next
    • Step 2: Transform records
      • Record transformation : Disabled
      • Record format conversion: Disabled
      • Click Next
    • Step 3: Choose destination
      • Destination : Amazon S3
      • S3 bucket : yourname-datalake-demo-bucket
      • Prefix : data/raw/ (slash / after raw is important, if you miss it Firehose will copy the data in an undesired location)
      • Error prefix : Leave Blank
      • Click Next
    • Step 4: Configure settings
      • S3 buffer conditions: Buffer size: 1 (MB)
      • S3 buffer conditions: Buffer interval: 60 (sec)
      • S3 compression : Disabled
      • S3 encryption : Disabled
      • Error logging : Enabled
      • Leave everything else default
      • IAM role : Click on Create new, or Choose
        • Open a new window, leave everything to default
        • Click - Allow
      • Click - Next
    • Step 5: Review
      • Review the configuration & make sure its as mentioned above
      • Click - Create delivery stream

Generate Dummy Data

In this step we will configure Kinesis Data Generator to produce fake data and ingest it into Kinesis Firehose

  • Configure Amazon Cognito for Kinesis Data Generator - In this step we will launch a cloud formation stack that will configure Cognito. This cloudformation scripts launches in Oregon region (No need to change this region)
  • On Amazon Kinesis Data Generator homepage
    • Login with your username & password from previous step
    • Region: us-east-1
    • Stream/delivery stream : sg-summit-demo-stream
    • Records per second : 2000
    • **Record template : **In the big text area, add the following json template
    • Click - Send Data - do not click without pasting the below bit of template in the big text area

Once the tools send ~ 100,000 messages, you can click on - Stop sending data to Kinesis


{
  "uuid": "{{random.uuid}}",
  "device_ts": "{{date.utc("YYYY-MM-DD HH:mm:ss.SSS")}}",
  "device_id": {{random.number(50)}},
  "device_temp": {{random.weightedArrayElement(
    {"weights":[0.30, 0.30, 0.20, 0.20],"data":[32, 34, 28, 40]}
  )}},
  "track_id": {{random.number(30)}},  
  "activity_type": {{random.weightedArrayElement(
        {
            "weights": [0.1, 0.2, 0.2, 0.3, 0.2],
            "data": ["\"Running\"", "\"Working\"", "\"Walking\"", "\"Traveling\"", "\"Sitting\""]
        }
    )}}
}

Validate that data has arrived in S3

After few moments GoTo S3 console:https://s3.console.aws.amazon.com/s3/home?region=us-east-1

  • Click - yourname-datalake-demo-bucket > Data
  • There should be a folder called raw created > Open it and keep navigating, you will notice that firehose has dumped the data in S3 using yyyy/mm/dd/hh partitioning

Part 2 : Catalog Data

Create IAM Role

In this step we will navigate to IAM Console & create a new Glue service role, this allows AWS Glue to access data sitting in S3 and create necessary entities in Glue catalog.

  • GoTo: https://console.aws.amazon.com/iam/home?region=us-east-1#/roles
  • Click - Create role
    • Choose the service that will use this role: Glue
    • Click - Next: Permissions
    • Search for - AmazonS3FullAccess
      • Select Checkbox
    • Search for - AWSGlueServiceRole
      • Select Checkbox
    • Click - Next: Tags
      • Leave as default, no changes here
    • Click - Next: Review
    • Role name: AWSGlueServiceRoleDefault
    • make sure that are two policies attached to this role (AmazonS3FullAccess, AWSGlueServiceRole)
    • Click - Create role

Create AWS Glue Crawlers

In this step, we will navigate to AWS Glue Console & create glue crawlers to discovery the newly ingested data in S3.

  • GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1
  • On the left panel, click on Crawlers > Click on Add Crawler
    • Crawler info
      • Crawler name: summitcrawler
      • Click - Next
    • Data store
      • Data store: S3
      • Crawl data in: Specified path in my account
      • Include path: s3://yourname-datalake-demo-bucket/data/
      • Click - Next
      • Add another data store : No
      • Click - Next
    • IAM Role
      • Choose: Create an IAM role
      • Role Name: AWSGlueServiceRoleDefault
      • Click - Next
    • Schedule
      • Frequency: Run on demand
      • Click - Next
    • Output
      • Click - Add database
      • Database name: summitdb
      • Click - Create
      • Click - Next
    • Review all steps
      • Review the configuration & make sure its as mentioned above
      • Click - Finish
  • You should see a message : Crawler summitcrawler was created to run on demand.
    • Click - Run on demand This will run the crawler
    • Wait for few minutes

Verify newly created tables in catalog

Navigate to Glue Catalog & explore the crawled data:

Query ingested data using Amazon Athena

Lets query newly ingested data using Amazon Athena

SELECT activity_type,
         count(activity_type)
FROM raw
GROUP BY  activity_type
ORDER BY  activity_type

Part 3 : Transform Data

Create Glue Development Endpoint

In this step you will be creating a glue endpoint to interactively develop Glue ETL scripts using PySpark

  • GoTo : https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=devEndpoints
  • Click - Add endpoint
    • Development endpoint name - devendpoint1
      • IAM role - AWSGlueServiceRoleDefault
      • Expand - Security configuration.. parameters
        • Data processing units (DPUs): 2 (this affects the cost of the running this lab)
    • Click - Next
    • Networking screen :
      • Choose - Skip networking information
    • Add an SSH public key (Optional)
      • Leave as defaults
      • Click: Next
    • Review the settings
      • Click: Finish

It will take close to 10 mins for the new Glue console to spin up.

You have to wait for this step to complete before moving to next step.

Create SageMaker Notebooks (Jupyter) for Glue Dev Endpoints

This will take few minutes, wait for this to finish

Launch Jupyter Notebook

  • Download and save this file locally on your laptop : summit-techfest-datalake-notebook.ipynb
  • GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=notebooks
  • Click - aws-glue-notebook1
  • Click - Open, This will open a new tab
  • On Sagemaker Jupyter Notebook
    • Click - Upload (right top part of screen)
    • Browse and upload summit-techfest-datalake-notebook.ipynb which you downloaded earlier
    • Click - Upload to confirm the download
    • Click on **summit-techfest-datalake-notebook.ipynb ** to open the notebook
    • Make sure it says 'Sparkmagic (PySpark)' on top right part of the notebook, this is the name of the kernel Jupyter will use to execute code blocks in this notebook

Follow the instructions on the notebook - Read and understand the instructions, they explain important Glue concepts

Validate - Transformed / Processed data has arrived in S3

Once the ETL script has ran successfully. console:https://s3.console.aws.amazon.com/s3/home?region=us-east-1

  • Click - yourname-datalake-demo-bucket > data
  • There should be a folder called processed-data created here > Open it & ensure that .parquet files are created in this folder.

Part 4 : Analyze

Explore transformed data using Athena

In this step we will analyze the transformed data using Athena

Login to the Amazon Athena Console.

SELECT artist_name,
         count(artist_name) AS count
FROM processed_data
GROUP BY  artist_name
ORDER BY  count desc
  • Explore the Athena UI and try running some queries
  • This query returns the list of tracks repeatedly played by devices , we will later visualize this using QuickSight
SELECT device_id,
         track_name,
         count(track_name) AS count
FROM processed_data
GROUP BY  device_id, track_name
ORDER BY  count desc

Part 4: Visualize

Setting Up QuickSight

In this step we will visualize it using QuickSight

Login to Amazon Quick Sight Console & complete the registration & sign-up

Setting QuickSight Permissions

Adding a New Dataset

  • GoTo: https://us-east-1.quicksight.aws.amazon.com/sn/start
  • On top right, Click - Manage Data
    • Click - New Data Set
    • Click - Athena
    • New Athena data source
      • Data source name: summitdemo
      • Click - Validate Connection
        • This will check if your QuickSight can access Athena
        • If validation fails - click on your username on top-right >
      • Click - Create data source
    • Choose your table:
      • Database: contain sets of tables: select - summitdb
      • Tables: contain the data you can visualize : select - processed_data
      • Click - Select
    • Finish data set creation:
      • Select - Directly query your data
      • Click Visualize

Using Amazon Quick Sight to Visualize Our Processed Data

Visualization 1: Heat map of users and tracks they are listening to

In this step, we will create a visualization that show us which users are listening to repetitive tracks

  • On the bottom-left panel - Visual types
    • Hover on icon there to see names of the visualizations
    • Click on - Heat Map
  • On top-left panel - Fields list
    • Click - device_id
    • Click - track_name
  • Just above the visualization you should see Field wells : [Rows - device_id] [Columns - track_name]

If you hover on dark blue patches on the heatmap you will see that those particular users are listening to same track repeatedly.

Visualization 2: Tree map of most played Artist Names

In this step we will create a visualization that shows who are the host played artists

  • On top-left - Click on '+ Add' > Add Visual, this will add a new panel to the right pane
  • On the bottom-left panel - Visual types
    • Hover on icon there to see names of the visualizations
    • Click on - Tree Map
  • On top-left panel - Fields list
    • Click - artist_name

Play around and explore Amazon QuickSight Console. Try out filters, other visualization types, etc.

Clean Up

Failing to do this will result in incuring AWS usage charges.

Make sure you bring down / delete all resources created as part of this lab

Resources to delete