aws-toolkit-recent-search

This toolkit provides a framework to ingest, process, and store Twitter data. The toolkit leverages Twitter's new recent search API v2, allowing you to fetch Tweets from the last seven days that match a specific search query.

Follow the steps below to ingest Tweets into an AWS storage solution.

What services does this toolkit leverage?

This toolkit requires a Twitter developer account and access to the Twitter API. The following two levels of access are free of charge:
- Essential access gives you 500K Tweets/month
- Elevated access gives you 2M Tweets/month
This toolkit leverages AWS Lambda and RDS. Pricing information for AWS services can be found here.

Prerequisites

A Twitter Developer account: sign up here.
A bearer token to authenticate your requests to the Twitter API: refer to this documentation.
An AWS account (the free tier is sufficient): create an account here.

Implementation: step-by-step guide

Step 1: Create RDS MySQL database and associated tables

Start by cloning this repository locally.
Create an RDS MySQL database:

Login to your AWS account and navigate to AWS CloudFormation.
Select "Create stack".
Select "Template is ready" and "Upload a template file".
Using "Choose file", upload the file entitled rds.yaml (you can find this in the cloudformation directory of this GitHub repository, which you cloned locally).
Select "Next".
Enter a stack name, for example "rds". Other parameter values will be automatically populated from the CloudFormation template you just uploaded. Select "Next".
No need to Configure stack options and Advanced options. Select "Next".
Scroll to the bottom of the page and select "Create stack".
The database is now being created. This will take several minutes to complete.

Once the process is complete, make a note of your database endpoint. To find this:

Navigate to the "Stacks" section of CloudFormation.
Select the stack you just created: "rds".
Select "Resources", then click on the Physical ID "mysqldbtweets".This will take you to the RDS Management Console.
Scroll down to the "Connect" section and you will find the endpoint there. This will look something like: mysqldbtweets.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com

Add a new inbound rule:

Navigate to Amazon RDS, select "DB Instances", and click into your database "mysqldbtweets".
In the "Connectivity & security" tab, under "Security", click on the default VPC security group.
Select the "Inbound rules" tab.
Click "Edit inbound rules" and "Add rule". Your new rule should have the following properties:
- Type: All traffic
- Source: IPv4 and 0.0.0.0/0

Before you move onto the next step, be sure to check that your DB instance has a Status of "Available". You can check this by navigating to RDS > DB Instances. Note that, once created, it can take up to 20 minutes for your database instance to become available for network connections.
You’re now ready to create tables in the database you just created:

Navigate to your local version of the GitHub repository. Rename event_data_creds_example.json to event_data_creds.json
On line 3, add the endpoint url for your database, the one you fetched in point 3 above. Make sure to add event_data_creds.json to your .gitignore file to avoid sharing your credentials.
Install PyMySQL. You can install it with pip in your local command line: $ python3 -m pip install PyMySQL
In your local command line, navigate to the main aws-toolkit-recent-search directory, and run the following script: $ python3 create_tables.py

(Optional) You can download DBeaver Lite to connect to your database and view the tables you created.

Step 2: Deploy the ETL code as an AWS Lambda function

Locate lambda/script.zip in your local version of the GitHub repository.
Back in your AWS account, navigate to AWS Lambda.
Select "Create function".
Select "Author from scratch".
Name the function "etl-recent-search".
Under Runtime, select Python 3.9.
Click on "Create Function".
Select "Upload from" > ".zip file" and upload the lambda/script.zip file.
Click "Save".
Under "Configuration" > "General configuration": edit the Timeout to 15 minutes.

Step 3: Create function URL to trigger the Lambda function

Navigate to the AWS Lambda Functions page and select the "etl-recent-search" function you created in Step 2.
Under "Configuration", select "Function URL" and create a new function URL.
For auth type, select "NONE".
Make a note of the function URL you just created. The URL format will be as follows: https://<url-id>.lambda-url.<region>.on.aws. This URL will be used in Step 4 to form a cURL command that can trigger the Lambda function to fetch Tweets and store them.
Navigate to the AWS IAM Roles Console and create a new role with the following properties:

Trusted entity type – "AWS service"
Use case – Lambda
Permissions – AWSLambdaBasicExecutionRole
Role name – lambda-url-role

Step 4: Run cURL command to fetch and store Tweets of interest

Build the following cURL command. Make sure to add your own details and credentials where relevant, specifically:

Replace https://<url-id>.lambda-url.<region>.on.aws with the function URL you generated in the above step.
The "query" line determines what Tweets will get returned. You can edit this query to fetch Tweets of interest to you. Twitter's documentation contains details on how to build a search query.
Edit the start and end times to be within the last 7 days (if your start and end times are older than the last 7 days, the query will fail).
Add your Twitter bearer token. Twitter's documentation explains how to generate and find your bearer token.
Next to "endpoint" add the database endpoint you generated in Step 1.3 above.
If required, edit the region to reflect the region in which you created your database.

curl -X POST \
    'https://<url-id>.lambda-url.<region>.on.aws' \
    -H 'Content-Type: application/json' \
    -d '{
    "query": "((ipad OR iphone) apple) -is:retweet lang:en",
    "max_results": 100,
    "start_time": "2022-08-22T13:00:00Z",
    "end_time": "2022-08-22T13:30:00Z",
    "bearer_token": "XXXXX",
    "endpoint": "XXXXX",
    "user": "dbadmin",
    "region": "us-east-1a",
    "dbname": "searchtweetsdb",
    "password": "Test1230"
}'

In your local command line run this cURL command. This will trigger the AWS Lambda function you deployed in Step 2, connect to the Twitter API to fetch Tweets of interest and store these in the database you created in Step 1.

Please note: the cURL command might take a while to run if you are fetching large amounts of data. Anything that takes longer than 15 minutes will automatically timeout. If this happens, try reducing the time period between your "start_time" and "end_time".

If you run into any errors, you may want to check the logs to troubleshoot the cause of the issue. You can find these under "Lambda" > "Functions" > "etl-recent-search" > "Monitor" > "logs". There you’ll find a more verbose description of the error.

Connect to your database to view the stored data. You can use DBeaver Lite to do this.

Notes

This toolkit in intended as an example framework that quickly fetches, parses, and stores Twitter data.

The following data objects are extracted and persisted:

Tweets (including hastags, cashtags, annotations, mentions, urls)
Users

The following data objects will not be persisted:

Media
Polls
Places
Spaces
Lists

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aws-toolkit-recent-search

What services does this toolkit leverage?

Prerequisites

Implementation: step-by-step guide

Step 1: Create RDS MySQL database and associated tables

Step 2: Deploy the ETL code as an AWS Lambda function

Step 3: Create function URL to trigger the Lambda function

Step 4: Run cURL command to fetch and store Tweets of interest

Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cloudformation		cloudformation
lambda		lambda
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
event_data_creds_example.json		event_data_creds_example.json

xdevplatform/aws-toolkit-recent-search

Folders and files

Latest commit

History

Repository files navigation

aws-toolkit-recent-search

What services does this toolkit leverage?

Prerequisites

Implementation: step-by-step guide

Step 1: Create RDS MySQL database and associated tables

Step 2: Deploy the ETL code as an AWS Lambda function

Step 3: Create function URL to trigger the Lambda function

Step 4: Run cURL command to fetch and store Tweets of interest

Notes

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages