This project focuses on Natural Language Processing which is part of machine learning and a way for computers to learn and analyze the human language. The Kaggle competition can be found here: and the data was downloaded from
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). For this challenege, we are building a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.
- We can see from the initial training data EDA that there are 5 columns and 7,613 observations
- It looks like 'text' and 'target' are the features we'll focus on and can probably drop the other 3. There were also several null values in both 'keyword' and 'location'.
- Of the 7,613 'text' observations, 7,503 are unique so there must be 110 duplicates, we can take care of that
- More than half of the training tweets are NOT true disaster tweets
- We also have a test set of 4 columns (just missing the 'target' feature) and 3,263 observations