Automate Bucketing of Streaming Data

This repository accompanies the Automate Bucketing of Streaming Data using Amazon Athena and AWS Lambda blogpost. It contains an AWS Serverless Application Model (AWS SAM) template that deploys two AWS Lambda functions; LoadPartiton and Bucketing function.

The first function, runs every hour and reads the new folder created under /raw folder and loads this folder as a new partition to the SourceTable.

The second function, runs every hour and copies previous hour's data from /raw to /curated using Create Table AS Select (CTAS). The copied data is a new sub-folder under /curated. The function will then load the new folder as a Partition to TargetTable.

├── README.MD <-- This instructions file

├── functions <-- Two lambda functions used to bucket streaming data

Requirements

AWS CLI already configured with Administrator permission
Source and target tables created in Athena
Streaming data is writing into Amazon S3 bucket and partitioned like this: dt=YYYY-mm-dd-HH

Installation Instructions

Install SAM CLI if you do not have it.
Clone the repo onto your local development machine using git clone.
The lambda functions included here work on data that is partitioned on hourly basis. It will work with flat partition strategy that looks like the following; dt=YYYY-mm-dd-HH. If your data has a different structure, edit the lambda functions accordingly.
From the command line, change directory to SAM template directory


sam build

sam deploy --guided

Follow the prompts in the deploy process to set the stack name, AWS Region and other parameters.

Parameter Details

S3BucketName: the name of data lake S3 bucket for this application
CuratedKeyPrefix: Prefix of new bucketed files that are written by Function2. This is the Amazon S3 location of TargetTable without 's://<s3_bucket_name>'. Do not add the trailing slash. For example, /curated
AthenaResultLocation: Full S3 location where Athena will store query results in. For example, s3://<s3_bucket_name>/athena_results
DatabaseName: Data Catalog Database name that holds SourceTable and TargetTable
SourceTableName: Source Table Name that points to raw data
TargetTableName: Target Table name that points to curated data
BucketingKey: The column used as a bucketing key. The solution supports a single bucketing key, to add more edit the lambda function.
BucketCount: Number of hive buckets to create within a partition. This has to be the same number that was used when creating TargetTable.

How it works

Start writing streaming data to S3 bucket
Create SourceTable and TargetTable in Athena
After an hour of the SAM deployment, you will see new data written CuratedKeyPrefix. The data will be bucketed and could be queried from TargetTable in Athena.

==============================================

SPDX-License-Identifier: MIT-0

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.aws-sam/build		.aws-sam/build
functions		functions
README.md		README.md
template.yml		template.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automate Bucketing of Streaming Data

Requirements

Installation Instructions

Parameter Details

How it works

About

Releases

Packages

Languages

AhmedsZamzam/AutoBucketing

Folders and files

Latest commit

History

Repository files navigation

Automate Bucketing of Streaming Data

Requirements

Installation Instructions

Parameter Details

How it works

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages