Skip to content

dannyyy-jimenez/CapstoneOne

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify Tracks 1922-2020

CHECK OUT MY SLIDESHOW PRESENTATION ON MY DATA

NOTE: The data folder is not in this repo due to GitHub's size limit, here is where you can find the tracks.csv file needed for this project to run.

NOTE: The Spotify API auth key is also missing for security purposes. In order to get your own Spotify API key follow these instructions


EDA

A Look at the Data ~ 600k Tracks

  • ID ◼
  • Name ◼
  • Popularity ★
  • Explicit ♦
  • Artists (◼,◼)
  • Release Date ◼
  • Danceability ♣
  • Energy ♣
  • Key ★
  • Loudness ♣
  • Speechiness ♣
  • Acousticness ♣
  • Instrumentalness ♣
  • Liveness ♣
  • Valence ♣
  • Mode ★
  • Tempo ♣
  • Time Signature ★
  • Duration ♣
  • Release Year ★
◼ string, ★ int, ♦ boolean, ♣ float,

Key Metrics Used

  • Name ◼
  • Popularity ★
  • Danceability ♣
  • Duration ♣
  • Release Year ★

Visualizing the Data

Release Year Amounts

Boxplot Duration

Boxplot Danceability

Note: The boxplots exclude outliers

Change in Popularity Over Time

Change in Explicit Over Time

Change in Danceability Over Time

Change in Duration Over Time


Hypothesis Testing

What Changes The Popularity?

  • Explicit
  • Duration

Popularity vs Explicitness

Scatter Between Popularity and Explicitness

  • H0: Explicit = Non Explicit

  • HA: Explicit > Non Explicit

  • Using a bootstrapping technique we can simulate grabbing multiple means of sample data sets from our sample

Histogram of Bootstraps

  • 95% Confidence Intervals

    • Non Explicit: (26.72, 26.74)

    • Explicit: (45.68, 45.69)

  • Since my confidence intervals never overlap, I will reject the null hypothesis. There is enough evidence to show the mean population popularity between explicit songs and non-explicit songs is greater in the explicit songs.

Popularity vs Duration

  • H0: Popularity of songs with length less than or equal to 5 minutes is equal to the Popularity of songs with length greater than 5 minutes

  • HA: Popularity of songs with length less than or equal to 5 minutes is greater than the Popularity of songs with length greater than 5 minutes

  • Using the Central Limit Theorem I will make a normal model and using Welch's T-Test I will calculate my P-value.

Histogram of Dist Duration

  • At an α level of 0.05 and with a P-value of 1.57e-34 I will Reject the Null Hypothesis. There is enough evidence to show that the population popularity mean for songs less than or equal to 5 min in length is greater than that of songs greater than 5 min in length.

CLT for Duration

Spotfiy API

Using The Spotify API to Compare Artists

I was able to create a couple helper functions that would simplify the process of comparing two different artists using the Spotify API and a given metric. Through that, we can go ahead an make faster hypothesis tests.

Using the Central Limit Theorem we can conduct Welch's T-Test on the wanted metric and from there receive a P-Value

We can assume our Null and Alternative Hypothesis go as follows:

  • H0: ArtistOneMetric = ArtistTwoMetric

  • HA: ArtistOneMetric > ArtistTwoMetric

Below you'll find the doc strings for the functions that make this process simply for you to use.

def GetTwoArtists(artist_one, artist_two, years=None, nofeatures=True, metric='popularity', save=False, nameAppend=""):
    """Get the test for two different artists according to a metric.

    NOTE: CompareArtistsCLT() is called from within this function

    Parameters
    ----------
    artist_one : list<track>
        a list of analysis track objects
    artist_two : list<track>
        a list of analysis track objects
    years : string
        years to look for in spotify data
    nofeatures : boolean
        do we want data with features
    metric : string
        the metric to look for
    save : boolean
        save figure?
    nameAppend : string
        text to append to filename
    Returns
    -------
    artist_one
        artist_one track analysis
    artist_two
        artist_two track analysis
    """

def CompareArtistsCLT(self, artists, metric='popularity', labels=[], save=False, nameAppend=""):
      """Compare artists metrics using central limit theorem and t testing

      Parameters
      ----------
      artists : list<track data>
          artists data to use
      metric : string
          metric to measure
      labels : list<string>
          list of strings to label our normal models
      save : boolean
          save figure?
      nameAppend : string
          text to append to data

      Returns
      -------
      float
          p value  of our t test
      """

Below you'll find some sample tests I've conducted as well as their result

XXX Tentacion vs Juice World (Popularity)

X vs Juice

Reject Null Hypothesis

Kanye West vs J Cole (Popularity)

Kanye vs J Cole

Fail to Reject Null Hypothesis

Bad Bunny vs J Balvin (Danceability)

Bad Bunny vs J Balvin

Fail to Reject Null Hypothesis

About

Galvanize Capstone One

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages