ML on the MCU

This project was possible due to the datasets available on www.kaggle.com. Huge thank you to the creators of both datasets. Find out more about the datasets and their authors:

TO RUN THIS PROJECT, YOU MUST ADD THE DATA INTO THE DATA FOLDER

DATA > IMAGES > ALL [FOR THE IMAGES]

DATA > MCU.CSV [FOR THE SCRIPTS]

See my slideshow presentation -> https://docs.google.com/presentation/d/1jZAaJ9fcv64-ndfwdH1Vw5dqB6IYyTe3vrMo1pGYMp0/edit?usp=sharing

Natural Language Processing

Limitations and Challenges
- Data Imbalance
- Scripts written by multiple writers
- Non Character Significant dialogue
Metrics
- Accuracy
- Recall
- Precision
Multinomial Naive Bayes
- Lemmatization
- English and Custom Stop Words

Exploratory Data Analysis

I explored a variety of characters and their lines within the film but there was a very noticeable imbalance within the amount of lines each character had. Tony Stark came in first with a total of ~1700 lines. Below is a bar chart of the 5 characters I chose for my initial analysis but within my plots folder you'll find other examples I did as well.

Word Importance Within Characters

The complications of using this script for machine learning include scripts written by multiple writers and non significant character dialogue (plenty). Many of the lines the model cannot correctly classify are words that have no significance towards the character such as Go get him or thanks. Phrases that hold no owner because they can in theory be said by any human. With the usage of custom stop words vs using english stop words we can try and counter this issue a bit. Below you'll see the unique words for characters within the film. But BEWARE: some characters just don't have enough significant lines and quantity of lines that their significant words are also just regular common words

Using English Stop Words

Using Custom Stop Words

Accuracy vs Recall vs Precision

Accuracy does not always means a model is good. The model using english stop words had a 0.44% accuracy on predicting never before seen phrases but had 0 recall on two different characters, meaning it guessed these characters as a classifier 0 times, completely disregarding that they even exist. See Confusion Matrix Below

When removing Hulk, Natasha, and Tony Stark from our characters used, you'll see a positive change in accuracy and recall. The confusion matrix shows what we'd like to see in a confusion matrix, a brighter diagonal (where bright colors represent correct guesses) See Confusion Matrix Below

When entered into our model, it showed 70.21% accuracy. The important word distribution was a lot different, including actual unique words such as "loki" for Thor.

Checkout some of the phrases my model got wrong. Would you have been able to guess them?

thank you

Model Predicted: TONY STARK

Actually Said By: STEVE ROGERS

so the question is who in shield could launch a domestic missile strike

Model Predicted: TONY STARK

Actually Said By: NATASHA ROMANOFF

pregnant

Model Predicted: TONY STARK

Actually Said By: STEVE ROGERS

Image Classification

Exploratory Data Analysis
- Image Distribution
  - Chris Evans - 50
  - Chris Hemsworth - 53
  - Mark Ruffalo - 63
  - Robert Downey jr - 51
  - Scarlett Johansson - 54
Metrics
- Accuracy
Neural Networks
- CNNs
Logistic Regression

Exploratory Data Analysis

The kaggle dataset features the same five characters I used for NLP, Chris Evans, Chris Hemsworth, Robert Downey JR, Scarlett Johansson, and Mark Ruffalo.

Here's how the original images look like

In order to be able to do good image classification and not over complicate things with multiple dimensions we grayscale the images. To add some more data to our model we also flip every image left to right, so our model learns a different way of seeing these images and features.

Image Classification Models

Neural Network	Simple CNN	Custom Deeper CNN	Logistic Regression
Total params: 15,813,125	Total params: 127,269	Total params: 6,685,573	Solver: 'Saga' Tol: 0.1
3 Layers Flatten Dense Dense Output (Softmax)	7 Layers Avg Pooling 2 Conv Max Pooling Dropout Flatten Output (Softmax)	13 Layers Max Pooling 2 Conv Max Pooling 2 Conv Dropout Max Pooling Conv Flatten Dense Dropout Output (Softmax)	Pixel Importance Better than first 2 models More data
40.1% Accuracy	54.7% Accuracy	92% Accuracy	67.88% Accuracy

Models Loss

Default Neural Network

Simple Convolutional Neural Network

Models Confusion Matrix

Default Neural Network

Simple Convolutional Neural Network

Custom Convolutional Neural Network

The Beauty of Logistic Regression Classifiers

Logistic Regression allows for better interpretability. One of the best parts of easier interpretability is being able to see what our model uses to predict one or the other. Through pixel importance we can see what pixels positively impact our model when considering a class.

Check out the pixel importance for our different classes. Blue are the pixels that positively impact our prediction probability and red are the pixels that negatively impact our prediction probability.

Do they look like the people they represent? & Does it give you an idea of why our model can struggle a bit in correctly identifying each class?

Hope you enjoyed my project! Feel free to check out my github for other projects I've done.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
plots		plots
src		src
.gitignore		.gitignore
PROPOSALS.md		PROPOSALS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML on the MCU

TO RUN THIS PROJECT, YOU MUST ADD THE DATA INTO THE DATA FOLDER

DATA > IMAGES > ALL [FOR THE IMAGES]

DATA > MCU.CSV [FOR THE SCRIPTS]

Natural Language Processing

Exploratory Data Analysis

Word Importance Within Characters

Using English Stop Words

Using Custom Stop Words

Accuracy vs Recall vs Precision

Checkout some of the phrases my model got wrong. Would you have been able to guess them?

Image Classification

Exploratory Data Analysis

Image Classification Models

Models Loss

Default Neural Network

Simple Convolutional Neural Network

Simple Convolutional Neural Network

Models Confusion Matrix

Default Neural Network

Simple Convolutional Neural Network

Custom Convolutional Neural Network

The Beauty of Logistic Regression Classifiers

Do they look like the people they represent? & Does it give you an idea of why our model can struggle a bit in correctly identifying each class?

About

Releases

Packages

Languages

dannyyy-jimenez/CapstoneTwo

Folders and files

Latest commit

History

Repository files navigation

ML on the MCU

TO RUN THIS PROJECT, YOU MUST ADD THE DATA INTO THE DATA FOLDER

DATA > IMAGES > ALL [FOR THE IMAGES]

DATA > MCU.CSV [FOR THE SCRIPTS]

Natural Language Processing

Exploratory Data Analysis

Word Importance Within Characters

Using English Stop Words

Using Custom Stop Words

Accuracy vs Recall vs Precision

Checkout some of the phrases my model got wrong. Would you have been able to guess them?

Image Classification

Exploratory Data Analysis

Image Classification Models

Models Loss

Default Neural Network

Simple Convolutional Neural Network

Simple Convolutional Neural Network

Models Confusion Matrix

Default Neural Network

Simple Convolutional Neural Network

Custom Convolutional Neural Network

The Beauty of Logistic Regression Classifiers

Do they look like the people they represent? & Does it give you an idea of why our model can struggle a bit in correctly identifying each class?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages