This toolkit provides a framework to ingest, process, and store Twitter data. The toolkit leverages Twitter's new recent search API v2, allowing you to fetch Tweets from the last seven days that match a specific search query.
Follow the steps below to ingest Tweets into an AWS storage solution.
-
This toolkit requires a Twitter developer account and access to the Twitter API. The following two levels of access are free of charge:
- Essential access gives you 500K Tweets/month
- Elevated access gives you 2M Tweets/month
-
This toolkit leverages AWS Lambda and RDS. Pricing information for AWS services can be found here.
- A Twitter Developer account: sign up here.
- A bearer token to authenticate your requests to the Twitter API: refer to this documentation.
- An AWS account (the free tier is sufficient): create an account here.
-
Start by cloning this repository locally.
-
Create an RDS MySQL database:
- Login to your AWS account and navigate to AWS CloudFormation.
- Select "Create stack".
- Select "Template is ready" and "Upload a template file".
- Using "Choose file", upload the file entitled
rds.yaml
(you can find this in the cloudformation directory of this GitHub repository, which you cloned locally). - Select "Next".
- Enter a stack name, for example "rds". Other parameter values will be automatically populated from the CloudFormation template you just uploaded. Select "Next".
- No need to Configure stack options and Advanced options. Select "Next".
- Scroll to the bottom of the page and select "Create stack".
- The database is now being created. This will take several minutes to complete.
- Once the process is complete, make a note of your database endpoint. To find this:
- Navigate to the "Stacks" section of CloudFormation.
- Select the stack you just created: "rds".
- Select "Resources", then click on the Physical ID "mysqldbtweets".This will take you to the RDS Management Console.
- Scroll down to the "Connect" section and you will find the endpoint there. This will look something like:
mysqldbtweets.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com
- Add a new inbound rule:
- Navigate to Amazon RDS, select "DB Instances", and click into your database "mysqldbtweets".
- In the "Connectivity & security" tab, under "Security", click on the default VPC security group.
- Select the "Inbound rules" tab.
- Click "Edit inbound rules" and "Add rule". Your new rule should have the following properties:
- Type: All traffic
- Source: IPv4 and 0.0.0.0/0
-
Before you move onto the next step, be sure to check that your DB instance has a Status of "Available". You can check this by navigating to RDS > DB Instances. Note that, once created, it can take up to 20 minutes for your database instance to become available for network connections.
-
You’re now ready to create tables in the database you just created:
- Navigate to your local version of the GitHub repository. Rename
event_data_creds_example.json
toevent_data_creds.json
- On line 3, add the endpoint url for your database, the one you fetched in point 3 above. Make sure to add
event_data_creds.json
to your.gitignore
file to avoid sharing your credentials. - Install PyMySQL. You can install it with pip in your local command line:
$ python3 -m pip install PyMySQL
- In your local command line, navigate to the main aws-toolkit-recent-search directory, and run the following script:
$ python3 create_tables.py
- (Optional) You can download DBeaver Lite to connect to your database and view the tables you created.
-
Locate lambda/script.zip in your local version of the GitHub repository.
-
Back in your AWS account, navigate to AWS Lambda.
-
Select "Create function".
-
Select "Author from scratch".
-
Name the function "etl-recent-search".
-
Under Runtime, select Python 3.9.
-
Click on "Create Function".
-
Select "Upload from" > ".zip file" and upload the lambda/script.zip file.
-
Click "Save".
-
Under "Configuration" > "General configuration": edit the Timeout to 15 minutes.
-
Navigate to the AWS Lambda Functions page and select the "etl-recent-search" function you created in Step 2.
-
Under "Configuration", select "Function URL" and create a new function URL.
-
For auth type, select "NONE".
-
Make a note of the function URL you just created. The URL format will be as follows:
https://<url-id>.lambda-url.<region>.on.aws
. This URL will be used in Step 4 to form a cURL command that can trigger the Lambda function to fetch Tweets and store them. -
Navigate to the AWS IAM Roles Console and create a new role with the following properties:
- Trusted entity type – "AWS service"
- Use case – Lambda
- Permissions – AWSLambdaBasicExecutionRole
- Role name – lambda-url-role
- Build the following cURL command. Make sure to add your own details and credentials where relevant, specifically:
- Replace
https://<url-id>.lambda-url.<region>.on.aws
with the function URL you generated in the above step. - The "query" line determines what Tweets will get returned. You can edit this query to fetch Tweets of interest to you. Twitter's documentation contains details on how to build a search query.
- Edit the start and end times to be within the last 7 days (if your start and end times are older than the last 7 days, the query will fail).
- Add your Twitter bearer token. Twitter's documentation explains how to generate and find your bearer token.
- Next to "endpoint" add the database endpoint you generated in Step 1.3 above.
- If required, edit the region to reflect the region in which you created your database.
curl -X POST \
'https://<url-id>.lambda-url.<region>.on.aws' \
-H 'Content-Type: application/json' \
-d '{
"query": "((ipad OR iphone) apple) -is:retweet lang:en",
"max_results": 100,
"start_time": "2022-08-22T13:00:00Z",
"end_time": "2022-08-22T13:30:00Z",
"bearer_token": "XXXXX",
"endpoint": "XXXXX",
"user": "dbadmin",
"region": "us-east-1a",
"dbname": "searchtweetsdb",
"password": "Test1230"
}'
- In your local command line run this cURL command. This will trigger the AWS Lambda function you deployed in Step 2, connect to the Twitter API to fetch Tweets of interest and store these in the database you created in Step 1.
Please note: the cURL command might take a while to run if you are fetching large amounts of data. Anything that takes longer than 15 minutes will automatically timeout. If this happens, try reducing the time period between your "start_time" and "end_time".
If you run into any errors, you may want to check the logs to troubleshoot the cause of the issue. You can find these under "Lambda" > "Functions" > "etl-recent-search" > "Monitor" > "logs". There you’ll find a more verbose description of the error.
Connect to your database to view the stored data. You can use DBeaver Lite to do this.
This toolkit in intended as an example framework that quickly fetches, parses, and stores Twitter data.
The following data objects are extracted and persisted:
- Tweets (including hastags, cashtags, annotations, mentions, urls)
- Users
The following data objects will not be persisted:
- Media
- Polls
- Places
- Spaces
- Lists