Skip to content

sydney-machine-learning/sentimentanalysis-Hollywood

Repository files navigation

Sentiment analysis of movie scripts from Hollywood

This repository provides code and supplementary materials for the paper titled 'Longitudinal Abuse and Sentiment Analysis of Hollywood Oscar and Blockbuster Movie Dialogues using LLMs'.

Seminar

Publication

Task Description

This project explores the trends in abusive language and sentiment in Hollywood movies from 1950 to 2024, with a focus on Oscar-nominated films and top 10 box-office hits. We utilize modern NLP models such as RoBERTa to conduct multi-label classification on movie subtitles, analyzing shifts in emotions and the use of abusive language across time and genres.

It examines changes in sentiment and abusive language in movie dialogues over 75 years, focusing on the influence of social and cultural shifts. Using large-scale language models (LLMs) fine-tuned on movie subtitles, we analyze various genres and decades to identify emotional trends in Hollywood.

Datasets

Movies Subtitles: Subtitles from over 1,000 films, including Oscar-nominated films and the top 10 box-office hits, were collected. These films were categorized into four genres: Action, Comedy, Drama, and Thriller.

SenWave Dataset: This dataset includes sentiment-labeled tweets from the COVID-19 pandemic period. It is used to fine-tune our sentiment classification model for multi-label classification across emotions like optimism, anxiety, and anger. Additionally, the SenWave dataset from GitHub was utilised: SenWave Dataset

RAL-E Dataset: A Reddit-based dataset used for detecting abusive language, focusing on offensive, hateful, or violent content. The dataset was crucial for fine-tuning our abuse detection models. The dataset we used comes from Tommaso Caselli's HateBERT paper:RAL-E Dataset

Models

N-Gram Analysis: We conducted an N-Gram analysis (bigrams, trigrams) to visualize the most frequent word sequences in movie dialogues over time. This helped identify thematic trends and shifts in sentiment.

BERT-based Models: We used pre-trained RoBERTa and HateBERT models for sentiment analysis and abuse detection. RoBERTa was fine-tuned using the SenWave dataset for sentiment analysis, while HateBERT was used to detect abusive language in movie dialogues.

Results

  1. Sentiment Analysis Over Time

We performed sentiment analysis on movie dialogues from 1950 to 2024, identifying significant changes in emotional expression.

Sentiment Polarity Trends (1950-2024) The graph below shows the trend of sentiment polarity in movie dialogues over time, with sentiment polarity scores ranging from -1 to 1, where positive numbers represent positive emotions and negative numbers represent negative emotions.

Sentiment Weights by Decade The sentiment weights chart highlights the relative contribution of different emotions over the decades. Emotions like optimism, anger, and humor fluctuate in prominence across different time periods. image

  1. Abusive Language Detection

Abusive Word Frequency by Decade Abusive language frequency peaked in the 2000s and has since declined, reflecting changing societal norms.

Abusive Content Across Genres Action films show a low level of abusive content, while thrillers in the 1950s had the highest abusive word count. image

  1. Emotional Sentiment Co-occurrence

The heatmap below shows frequent co-occurrences of humor with anger, especially in comedies, reflecting the use of satire and conflict. image

About

Sentiment Analysis of Hollywood movie scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages