Skip to content

AhmedsZamzam/automated-bucketing-of-streaming-data

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Bucketing of Streaming Data

This repository accompanies the Automating bucketing of streaming data using Amazon Athena and AWS Lambda blog post. It contains an AWS Serverless Application Model (AWS SAM) template that deploys two AWS Lambda functions; LoadPartiton and Bucketing functions.

The LoadPartition function, runs every hour and reads the new folder created under /raw folder and loads this folder as a new partition to the SourceTable.

The Bucketing function, runs every hour and copies previous hour's data from /raw to /curated by executing Create Table AS Select (CTAS) query on Amazon Athena. The copied data is a new sub-folder under /curated. The function will then load the new folder as a Partition to TargetTable.

├── README.MD <-- This instructions file

├── functions <-- Two lambda functions used to bucket streaming data

Requirements

  • AWS CLI already configured with Administrator permission
  • Source and target tables created in Athena
  • Streaming data is writing into Amazon S3 bucket and partitioned like this: dt=YYYY-mm-dd-HH

Installation Instructions

  1. Install SAM CLI if you do not have it.

  2. Clone the repo onto your local development machine using git clone https://github.com/aws-samples/automated-bucketing-of-streaming-data.git.

  3. [OPTIONAL]The lambda functions included here, works on data that is partitioned on hourly basis. It will work with flat partition strategy that looks like the following; dt=YYYY-mm-dd-HH. If your data has a different structure, edit the lambda functions accordingly.

  4. From the command line, change directory to SAM template directory


sam build

sam deploy --guided

Follow the prompts in the deploy process to set the stack name, AWS Region and other parameters.

Parameter Details

  • S3BucketName: the name of data lake S3 bucket for this application
  • CuratedKeyPrefix: Prefix of new bucketed files that are written by Function2. This is the Amazon S3 location of TargetTable without 's://<s3_bucket_name>'. Do not add the trailing slash. For example, /curated
  • AthenaResultLocation: Full S3 location where Athena will store query results in. For example, s3://<s3_bucket_name>/athena_results
  • DatabaseName: Data Catalog Database name that holds SourceTable and TargetTable
  • SourceTableName: Source Table Name that points to raw data
  • TargetTableName: Target Table name that points to curated data
  • BucketingKey: The column used as a bucketing key. The solution supports a single bucketing key, to add more edit the lambda function.
  • BucketCount: Number of hive buckets to create within a partition. This has to be the same number that was used when creating TargetTable.

How it works

  • Start writing streaming data to S3 bucket

  • Create SourceTable and TargetTable in Athena

  • After an hour of the SAM deployment, you will see new data written CuratedKeyPrefix. The data will be bucketed and could be queried from TargetTable in Athena.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%