This project was possible due to the datasets available on www.kaggle.com. Huge thank you to the creators of both datasets. Find out more about the datasets and their authors:
See my slideshow presentation -> https://docs.google.com/presentation/d/1jZAaJ9fcv64-ndfwdH1Vw5dqB6IYyTe3vrMo1pGYMp0/edit?usp=sharing
-
Limitations and Challenges
- Data Imbalance
- Scripts written by multiple writers
- Non Character Significant dialogue
-
Metrics
- Accuracy
- Recall
- Precision
-
Multinomial Naive Bayes
- Lemmatization
- English and Custom Stop Words
I explored a variety of characters and their lines within the film but there was a very noticeable imbalance within the amount of lines each character had. Tony Stark came in first with a total of ~1700 lines. Below is a bar chart of the 5 characters I chose for my initial analysis but within my plots folder you'll find other examples I did as well.
The complications of using this script for machine learning include scripts written by multiple writers and non significant character dialogue (plenty). Many of the lines the model cannot correctly classify are words that have no significance towards the character such as Go get him or thanks. Phrases that hold no owner because they can in theory be said by any human. With the usage of custom stop words vs using english stop words we can try and counter this issue a bit. Below you'll see the unique words for characters within the film. But BEWARE: some characters just don't have enough significant lines and quantity of lines that their significant words are also just regular common words
Accuracy does not always means a model is good. The model using english stop words had a 0.44% accuracy on predicting never before seen phrases but had 0 recall on two different characters, meaning it guessed these characters as a classifier 0 times, completely disregarding that they even exist. See Confusion Matrix Below
When removing Hulk, Natasha, and Tony Stark from our characters used, you'll see a positive change in accuracy and recall. The confusion matrix shows what we'd like to see in a confusion matrix, a brighter diagonal (where bright colors represent correct guesses) See Confusion Matrix Below
When entered into our model, it showed 70.21% accuracy. The important word distribution was a lot different, including actual unique words such as "loki" for Thor.
thank you
Model Predicted: TONY STARK
Actually Said By: STEVE ROGERS
so the question is who in shield could launch a domestic missile strike
Model Predicted: TONY STARK
Actually Said By: NATASHA ROMANOFF
pregnant
Model Predicted: TONY STARK
Actually Said By: STEVE ROGERS
-
Exploratory Data Analysis
- Image Distribution
- Chris Evans - 50
- Chris Hemsworth - 53
- Mark Ruffalo - 63
- Robert Downey jr - 51
- Scarlett Johansson - 54
- Image Distribution
-
Metrics
- Accuracy
-
Neural Networks
- CNNs
-
Logistic Regression
The kaggle dataset features the same five characters I used for NLP, Chris Evans, Chris Hemsworth, Robert Downey JR, Scarlett Johansson, and Mark Ruffalo.
Here's how the original images look like
In order to be able to do good image classification and not over complicate things with multiple dimensions we grayscale the images. To add some more data to our model we also flip every image left to right, so our model learns a different way of seeing these images and features.
Neural Network | Simple CNN | Custom Deeper CNN | Logistic Regression |
---|---|---|---|
Total params: 15,813,125 | Total params: 127,269 | Total params: 6,685,573 | Solver: 'Saga' Tol: 0.1 |
3 Layers
|
7 Layers
|
13 Layers
|
|
40.1% Accuracy | 54.7% Accuracy | 92% Accuracy | 67.88% Accuracy |
Logistic Regression allows for better interpretability. One of the best parts of easier interpretability is being able to see what our model uses to predict one or the other. Through pixel importance we can see what pixels positively impact our model when considering a class.
Check out the pixel importance for our different classes. Blue are the pixels that positively impact our prediction probability and red are the pixels that negatively impact our prediction probability.
Do they look like the people they represent? & Does it give you an idea of why our model can struggle a bit in correctly identifying each class?
Hope you enjoyed my project! Feel free to check out my github for other projects I've done.