Skip to content

collecting twitter data with tweepy and writing it to mongodb

Notifications You must be signed in to change notification settings

CagriYonca/TwitterDataStream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Twitter Data Stream

In this project, we will follow the workflow below.

workflow-diagram

Version information for modules we used,

  • Tweepy v4.4.0
  • PyMongo v4.0.1
  • Pandas v1.3.4
  • Jupyter Notebook v6.4.5

You should be installed these modules with corresponding versions. Secondly, in order to fetch Tweets, we need a developer account. You can apply for a developer account on this page. I’ve applied as a student because I’m trying to learn how this service works and after application, they sent me an email to know more about the purpose of usage. After I explained this project they opened my account in 3–4 days. Lastly, to write the data we collect, we need a database on MongoDB. Go to this page and create a new account if you don’t have, yet. You can follow the first 11 min of this video to create.

Getting Connection Keys

Let’s start with preparing tokens for connections with Twitter and MongoDB.

Getting Twitter Connection Keys

  • Go to Projects & Apps section at Twitter Developer Portal.
  • Create a new application, you need to give it a unique name.
  • After creating your application, it will give you 4 different tokens as API Key, API Secret, Access Token and Access Secret. Note them somewhere.

Getting MongoDB Connection Keys

  • Create a project and a cluster into it. Wait a little bit, creating a new cluster takes a while.
  • Click to Connect button of the cluster and select ‘Connect your application’.
  • Select your Python version (you can use python --version command in the terminal) and copy the connection string in second section. Note it to the same place you used before, don’t forget to change of the string with your actual MongoDB password.

Explaining the Code

Now we are starting to code. Just start with basic imports:

code-1

Then set tokens we will use, don’t share your secret keys with anyone:

code-2

Now we need a connection client to MongoDB, we gonna create a connection pipe with this:

code-3

You need to change ‘connection string’ with your own. In the second line, we are creating a new database named ‘demo’ and a collection named tweet_collection in the third line. Lastly, we are creating an index named id on our tweet collection.

code-4

Now we are defining a listener class to listening to Twitter API and fetching data. This stream can be improved by adding more functionalities and preprocessing units there. Tweepy’s Stream class has its own built-in methods such as on_closed, on_connect, on_connection_error, on_data, etc. So it would be better if you know these states of the stream object.

code-5

When you run this cell, stream will start to fetch tweets immediately and it will keep fetching until you stop it. MongoDB’s free edition gives us limited storage to store data so it would be better if you stop it after 25–30 seconds. In the second line, it is filtering the data as you see, we are looking for ‘bbc’ keyword in tweets.

code-6

These two can be used to see the number of tweets and the number of distinct users.

code-7

We are creating an index to search faster at our collection, the ‘text’ parameter is the key name that is coming from Twitter API. You can check the incoming keys here.

code-8

Now we are creating a cursor list, this will find entities that include ‘covid’ string and append them into the list.

code-9

We can briefly print the cursor data like this. This is the endpoint of this streaming pipeline so you can use this data anywhere you want.

End of Stream

Now we have our tweet data here. You can use any tool or processing method you want. We won’t do anything detailed with Pandas but I wanted to show you one more thing. You can use the data directly through the Pandas as you see below.

code-10

Conclusion

Collecting data and preparing data pipelines are really important jobs these days. For a data engineer, data streaming is one of the most required skills to prepare data to work on it. There is a lot of different data API other than Twitter, such as CoinMarketCap for cryptocurrency, Facebook for FB posts, YouTube for YT videos. I’ll try to use and explain different APIs in the next weeks so see you all next time 🤞

Medium: https://cagriyonca.medium.com/fetching-data-from-twitter-to-mongodb-9cbf55a24361

About

collecting twitter data with tweepy and writing it to mongodb

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published