Skip to content

dannyyy-jimenez/CapstoneTwo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML on the MCU

This project was possible due to the datasets available on www.kaggle.com. Huge thank you to the creators of both datasets. Find out more about the datasets and their authors:

TO RUN THIS PROJECT, YOU MUST ADD THE DATA INTO THE DATA FOLDER
DATA > IMAGES > ALL [FOR THE IMAGES]
DATA > MCU.CSV [FOR THE SCRIPTS]

See my slideshow presentation -> https://docs.google.com/presentation/d/1jZAaJ9fcv64-ndfwdH1Vw5dqB6IYyTe3vrMo1pGYMp0/edit?usp=sharing


Natural Language Processing

  • Limitations and Challenges

    • Data Imbalance
    • Scripts written by multiple writers
    • Non Character Significant dialogue
  • Metrics

    • Accuracy
    • Recall
    • Precision
  • Multinomial Naive Bayes

    • Lemmatization
    • English and Custom Stop Words

Exploratory Data Analysis

I explored a variety of characters and their lines within the film but there was a very noticeable imbalance within the amount of lines each character had. Tony Stark came in first with a total of ~1700 lines. Below is a bar chart of the 5 characters I chose for my initial analysis but within my plots folder you'll find other examples I did as well.

# of Lines

Word Importance Within Characters

The complications of using this script for machine learning include scripts written by multiple writers and non significant character dialogue (plenty). Many of the lines the model cannot correctly classify are words that have no significance towards the character such as Go get him or thanks. Phrases that hold no owner because they can in theory be said by any human. With the usage of custom stop words vs using english stop words we can try and counter this issue a bit. Below you'll see the unique words for characters within the film. But BEWARE: some characters just don't have enough significant lines and quantity of lines that their significant words are also just regular common words

Using English Stop Words

English Stop Words

Using Custom Stop Words

Custom Stop Words

Accuracy vs Recall vs Precision

Accuracy does not always means a model is good. The model using english stop words had a 0.44% accuracy on predicting never before seen phrases but had 0 recall on two different characters, meaning it guessed these characters as a classifier 0 times, completely disregarding that they even exist. See Confusion Matrix Below

CM for All 5

When removing Hulk, Natasha, and Tony Stark from our characters used, you'll see a positive change in accuracy and recall. The confusion matrix shows what we'd like to see in a confusion matrix, a brighter diagonal (where bright colors represent correct guesses) See Confusion Matrix Below

CM for Steve and Thor

When entered into our model, it showed 70.21% accuracy. The important word distribution was a lot different, including actual unique words such as "loki" for Thor.

Line dist thor and steve

Checkout some of the phrases my model got wrong. Would you have been able to guess them?

thank you

Model Predicted: TONY STARK

Actually Said By: STEVE ROGERS


so the question is who in shield could launch a domestic missile strike

Model Predicted: TONY STARK

Actually Said By: NATASHA ROMANOFF


pregnant

Model Predicted: TONY STARK

Actually Said By: STEVE ROGERS


Image Classification

  • Exploratory Data Analysis

    • Image Distribution
      • Chris Evans - 50
      • Chris Hemsworth - 53
      • Mark Ruffalo - 63
      • Robert Downey jr - 51
      • Scarlett Johansson - 54
  • Metrics

    • Accuracy
  • Neural Networks

    • CNNs
  • Logistic Regression

Exploratory Data Analysis

The kaggle dataset features the same five characters I used for NLP, Chris Evans, Chris Hemsworth, Robert Downey JR, Scarlett Johansson, and Mark Ruffalo.

Here's how the original images look like

Mark

Robert

Chris

In order to be able to do good image classification and not over complicate things with multiple dimensions we grayscale the images. To add some more data to our model we also flip every image left to right, so our model learns a different way of seeing these images and features.

Gray

Image Classification Models

Neural Network Simple CNN Custom Deeper CNN Logistic Regression
Total params: 15,813,125 Total params: 127,269 Total params: 6,685,573 Solver: 'Saga'
Tol: 0.1
3 Layers
  • Flatten
  • Dense
  • Dense
  • Output (Softmax)
7 Layers
  • Avg Pooling
  • 2 Conv
  • Max Pooling
  • Dropout
  • Flatten
  • Output (Softmax)
13 Layers
  • Max Pooling
  • 2 Conv
  • Max Pooling
  • 2 Conv
  • Dropout
  • Max Pooling
  • Conv
  • Flatten
  • Dense
  • Dropout
  • Output (Softmax)
  • Pixel Importance
  • Better than first 2 models
  • More data
40.1% Accuracy 54.7% Accuracy 92% Accuracy 67.88% Accuracy

Models Loss

Default Neural Network

Default Neural Network

Simple Convolutional Neural Network

Simple Conv Neural Network

Simple Convolutional Neural Network

Custom Conv Neural Network

Models Confusion Matrix

Default Neural Network

Default Neural Network

Simple Convolutional Neural Network

Simple Conv Neural Network

Custom Convolutional Neural Network

Custom Conv Neural Network


The Beauty of Logistic Regression Classifiers

Logistic Regression allows for better interpretability. One of the best parts of easier interpretability is being able to see what our model uses to predict one or the other. Through pixel importance we can see what pixels positively impact our model when considering a class.

Check out the pixel importance for our different classes. Blue are the pixels that positively impact our prediction probability and red are the pixels that negatively impact our prediction probability.

Do they look like the people they represent? & Does it give you an idea of why our model can struggle a bit in correctly identifying each class?

pixel importance evans

pixel importance stark

pixel importance hems

pixel importance scarlett

pixel importance mark


Hope you enjoyed my project! Feel free to check out my github for other projects I've done.

About

Repo for my second Galvanize Capstone

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages