This project focuses on Natural Language Processing which is part of machine learning and a way for computers to learn and analyze the human language. The Kaggle competition can be found here: https://www.kaggle.com/c/nlp-getting-started/overview and the data was downloaded from https://www.kaggle.com/c/nlp-getting-started/data
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). For this challenege, we are building a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.
- We can see from the initial training data EDA that there are 5 columns and 7,613 observations
- It looks like 'text' and 'target' are the features we'll focus on and can probably drop the other 3. There were also several null values in both 'keyword' and 'location'.
- Of the 7,613 'text' observations, 7,503 are unique so there must be 110 duplicates, we can take care of that
- More than half of the training tweets are NOT true disaster tweets
- We also have a test set of 4 columns (just missing the 'target' feature) and 3,263 observations